HTML::Treebuilder - Parse between parents?

Up vote 1 down vote favorite share g+ share fb share tw.

Folks, There is so much info out there on HTML::Treebuilder that I'm surprised I can't find the answer, hopefully I'm not just missing it. What I'm trying to do is simply parse between parent nodes, so given a html doc like this something something something something something something something .... I want to be able to get the info about that 1st anchor tag (111), then process the 3 p tags and then get the next anchor tag (222) and then process those p tags etc etc. Its easy to get to each anchor tag use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse_file("index-01. Htm"); foreach my $atag ( $tree->look_down( '_tag', 'a' ) ) { if ($atag->attr('id')) { # Found 'a' tag, now process the p tags until the next 'a' } } But once I find that tag how do I then get all the p tags until the next anchor?

TIA! Perl html-parsing link|improve this question edited Oct 6 '10 at 22:13Sinan Ünür56.2k565161 asked Oct 6 '10 at 17:00Chris2918 57% accept rate.

HTML::TreeBuilder version #! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file(\*DATA); $tree->elementify; $tree->objectify_text; foreach my $atag ( $tree->look_down( '_tag', 'a' ) ) { if ($atag->attr('id')) { printf "Found %s\n", $atag->as_XML; process_p( $atag ); } } sub process_p { my ($tag) = @_; while ( defined( $tag ) and defined( my $next = $tag->right ) ) { last if lc $next->tag eq 'a'; if ( lc $next->tag eq 'p') { $next->deobjectify_text; print $next->as_text, "\n"; } $tag = $next; } } __DATA__ something something somethingsometext something something something something Output: Found something something something Found something something something HTML::TokeParser::Simple version #! /usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new(\*DATA); while ( my $tag = $parser->get_tag('a') ) { next unless $tag->get_attr('id'); printf "Found %s\n", $tag->as_is; process_p($parser); } sub process_p { my ($parser) = @_; while ( my $next = $parser->get_token ) { if ( $next->is_start_tag('a') ) { $parser->unget_token($next); return; } elsif ( $next->is_start_tag('p') ) { print $parser->get_text('/p'), "\n"; } } return; } Output: Found something something something Found something something something.

Thanks Sinan, this works almost perfect. I just noticed a issue in the html though, some of the "something" tags in the HTML actually look like "somethingsometext". When I try to run the above I get "Can't locate object method "tag" via package sometext".

– Chris Oct 6 '10 at 20:54 Is there some way to get examine the "sometext" that is appear and also be able to continue on without a error? TIA! – Chris Oct 6 '10 at 20:55 That throws a monkey wrench in things.

That's because the string is not wrapped in an HTML::Element. I'll post a solution in a few minutes. – Sinan Ünür Oct 6 '10 at 21:09 That would be awesome Sinan, thanks!

– Chris Oct 6 '10 at 21:26 @Chris Done! However, I am beginning to think HTML::TokeParser::Simple might be more appropriate for this task. – Sinan Ünür Oct 6 '10 at 21:34.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

HTML::Treebuilder - Parse between parents?

Related Questions

Perl Treebuilder HTML Parsing, can't seem to parse to DIV, getting error “Use of uninitialized value in pattern match?

How to rearrange html content with HTML::Treebuilder?

Ignore Text in HTML::TreeBuilder Output Perl?

How exactly does the “parent” function from HTML::TreeBuilder work?

Perl html treebuilder not returning string?

PHP simple html-dom parse, how parse javascript?