Grep and Extract Data in Perl?

Since it's HTML, you probably want the XPath module made for working with HTML HTML::TreeBuilder::XPath.

Since it's HTML, you probably want the XPath module made for working with HTML, HTML::TreeBuilder::XPath. First you'll need to parse your string using the HTML::TreeBuilder methods. Assuming your webpage's content is in a variable named $content, do it like this: my $tree = HTML::TreeBuilder->new; $tree->parse_file($file_name); Now you can use XPath expressions to get iterators over the nodes you care about.

This first expression gets all td nodes that are in a tr in a table in the body in the html element: my $tdNodes = $tree->findnodes('/html/body/table/tr/td'); Finally you can just iterate over all the nodes in a loop to find what you want: foreach my $node ($tdNodes->get_nodelist) { my $data = $node->findvalue('. '); // the content of the node print "$data\n"; } See the HTML::TreeBuilder documentation for more on its methods and the NodeSet documentation for how to use the NodeSet result object. W3schools has a passable XPath tutorial here.

With all this, you should be able to do pretty robust HTML parsing to grab out any element you want. You can even specify classes, ids, and more in your XPath queries to be really specific about which nodes you want. In my opinion, parsing HTML using this modified XPath library is a lot faster and more maintainable than dealing with a bunch of one-off regexes.

1 for having example code. – user181548 May 21 '10 at 23:53.

search.cpan.org/~msergeant/XML-XPath-1.1... XPath is the way.

I have very limited experience with it, but never saw it used to handle non-X HTML – DVK May 21 '10 at 23:31 @DVK: I wouldn't put it past an XPath module developed in Perl to try to be a little more clever. – Axeman May 21 '10 at 23:33 @Axeman - touche :) – DVK May 21 '10 at 23:34 I've always used the HTML::TreeBuilder::XPath library when using XPath to query HTML documents (search.cpan.Org/~mirod/HTML-TreeBuilder-XPath-0.11/lib/HTML/…). It's been pretty robust as far as I can tell (I've scraped tens of thousands of business locations from certain sites using it).

– jasonmp85 May 21 '10 at 23:44 I wanted to link you the HTML::TreeBuilder::XPath but I got it wrong when copying the link from google. I'm sorry. – dierre May 21 '10 at 23:57.

Use HTML parsing modules as described in answers to this Q - HTML::TreeBuilder or HTML::Parser. Purely theoretically you could try doing this using Regular Expressions to do this but as noted in the linked question's answers and countless other times on SO, parsing HTML with RegEx is a Bad Idea with capital letters - too easy to get wrong, too hard to get well, and impossible to get 100% right since HTML is not a regular language.

It might be theoretically impossible - HTML isn't a regular language. If his query is "regular", it would be possible. – Paul Nathan May 21 '10 at 23:48.

You might try this module: HTML::TreeBuilder::XPath. The doc says: This module adds typical XPath methods to HTML::TreeBuilder, to make it easy to query a document.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions