Up vote 1 down vote favorite share g+ share fb share tw.
Folks, There is so much info out there on HTML::Treebuilder that I'm surprised I can't find the answer, hopefully I'm not just missing it. What I'm trying to do is simply parse between parent nodes, so given a html doc like this something something something something something something something .... I want to be able to get the info about that 1st anchor tag (111), then process the 3 p tags and then get the next anchor tag (222) and then process those p tags etc etc. Its easy to get to each anchor tag use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new(); $tree->parse_file("index-01. Htm"); foreach my $atag ( $tree->look_down( '_tag', 'a' ) ) { if ($atag->attr('id')) { # Found 'a' tag, now process the p tags until the next 'a' } } But once I find that tag how do I then get all the p tags until the next anchor?
TIA! Perl html-parsing link|improve this question edited Oct 6 '10 at 22:13Sinan Ünür56.2k565161 asked Oct 6 '10 at 17:00Chris2918 57% accept rate.
HTML::TreeBuilder version #! /usr/bin/perl use strict; use warnings; use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new; $tree->parse_file(\*DATA); $tree->elementify; $tree->objectify_text; foreach my $atag ( $tree->look_down( '_tag', 'a' ) ) { if ($atag->attr('id')) { printf "Found %s\n", $atag->as_XML; process_p( $atag ); } } sub process_p { my ($tag) = @_; while ( defined( $tag ) and defined( my $next = $tag->right ) ) { last if lc $next->tag eq 'a'; if ( lc $next->tag eq 'p') { $next->deobjectify_text; print $next->as_text, "\n"; } $tag = $next; } } __DATA__ something something somethingsometext something something something something Output: Found something something something Found something something something HTML::TokeParser::Simple version #! /usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new(\*DATA); while ( my $tag = $parser->get_tag('a') ) { next unless $tag->get_attr('id'); printf "Found %s\n", $tag->as_is; process_p($parser); } sub process_p { my ($parser) = @_; while ( my $next = $parser->get_token ) { if ( $next->is_start_tag('a') ) { $parser->unget_token($next); return; } elsif ( $next->is_start_tag('p') ) { print $parser->get_text('/p'), "\n"; } } return; } Output: Found something something something Found something something something.
Thanks Sinan, this works almost perfect. I just noticed a issue in the html though, some of the "something" tags in the HTML actually look like "somethingsometext". When I try to run the above I get "Can't locate object method "tag" via package sometext".
– Chris Oct 6 '10 at 20:54 Is there some way to get examine the "sometext" that is appear and also be able to continue on without a error? TIA! – Chris Oct 6 '10 at 20:55 That throws a monkey wrench in things.
That's because the string is not wrapped in an HTML::Element. I'll post a solution in a few minutes. – Sinan Ünür Oct 6 '10 at 21:09 That would be awesome Sinan, thanks!
– Chris Oct 6 '10 at 21:26 @Chris Done! However, I am beginning to think HTML::TokeParser::Simple might be more appropriate for this task. – Sinan Ünür Oct 6 '10 at 21:34.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.