To address, your specific question, given the HTML.
Up vote 3 down vote favorite 2 share g+ share fb share tw.
Ever since I asked how to parse html with regex and got bashed a bit (rightfully so), I've been studying HTML::TreeBuilder, HTML::Parser, HTML::TokeParser, and HTML::Elements Perl modules. I have HTML like this: .45 (2006) I want to parse out the /45/subtitles-67624. Asp, but more importantly I want to know how to parse out the contents of the div.
I was given this example on a previous question: while ( my $anchor = $parser->get_tag('a') ) { if ( my $href = $anchor->get_attr('href') ) { #http://subscene.com/english/Sit-Down-Shut-Up-First-Season/subtitles-272112.aspx push @dnldLinks, $1 if $href =~ m! /subtitle-(\d{2,8})\. Aspx!
; } This worked perfectly for that, but when I tried to edit it a bit and use it on a `div it didn't work. Here is the code I tried: I tried using this code: while (my $anchor = $p->get_tag("dt")) { if($stuff = $anchor->get_attr('a1')) { print $stuff. "\n"; } } html perl html-parsing link|improve this question edited Nov 7 '09 at 12:13Sinan Ünür56.2k565161 asked Nov 7 '09 at 7:53Codygman279312 86% accept rate.
Sorry! Updated it! – Codygman Nov 7 '09 at 8:03 1 What module are you actually using?
You mention like five in your question, there's no such thing as HTML::TreeParser, and your code doesn't look like it's for HTML::TreeBuilder... – hobbs Nov 7 '09 at 8:04 I'm using HTML::TokeParser::Simple.. sorry for the confusion – Codygman Nov 7 '09 at 8:07 I think the previous question mentioned is this: stackoverflow.com/questions/1683555/… – user181548 Nov 7 '09 at 8:08.
To address, your specific question, given the HTML: .45 (2006) I am assuming you are interested in the anchor text, i.e. ".45 (2006)", in this case, but only if the anchor occurs in a div with id listSubtitlesFilm. #!
/usr/bin/perl use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new(handle => \*DATA); my @dnldLinks; while ( my $div = $parser->get_tag('div') ) { my $id = $div->get_attr('id'); next unless defined($id) and $id eq 'listSubtitlesFilm'; my $anchor = $parser->get_tag('a'); my $href = $anchor->get_attr('href'); next unless defined($href) and $href =~ m! /subtitles-(\d{2,8})\. Aspx\z!
; push @dnldLinks, $parser->get_trimmed_text('/a'), $1; } use Data::Dumper; print Dumper \@dnldLinks; __DATA__ .45 (2006) Output: $VAR1 = '.45 (2006)', '67624' .
Thanks SO much for the detailed explanation Sinan! Your making me love perl! :P – Codygman Nov 8 '09 at 6:54.
You could use (yet another module! ) HTML::TreeBuilder::XPath, which, as per its name, will let you use XPath on HTML::TreeBuilder objects. #!
/usr/bin/perl use strict; use warnings; use HTML::TreeBuilder::XPath; my $root = HTML::TreeBuilder::XPath->new_from_file( "my. Html"); # print $root->as_HTML; # useful to see how HTML::TreeBuilder # understands your HTML. For example it will wrap the implied # dl element around dt, which you need to take into account # when writing the XPath query below my $id= "a1"; # you need the .
//dt because of the extra dl my @divs= $root->findnodes( qq{//div. //dt\@id="$id"}); print $divs0->as_HTML; # or as_text.
Thanks mirod, using xpath seems like it will really help my RAD :) The comments were really helpful too, knowing how it understands my html is very important. – Codygman Nov 11 '09 at 21:03.
Code using HTML::TreeBuilder: use HTML::TreeBuilder; my $tree = HTML::TreeBuilder->new_from_content($html); for my $link ($tree->look_down( _tag => 'a', href => qr{/subtitle-\d{2,8}\. Aspx}) ) { my $linkid = $link->attr('href') =~ m! /subtitle-\d{2,8}\.
Aspx! ; # Scalar context gets the first, and the first is the nearest parent my $parent_div = $link->look_up(_tag => 'div'); # Now the interesting bit of the link is in $linkid, the parent div ID # is $parent_div->id or $parent_div->attr_id, and its text is e.g. # $parent_div->as_trimmed_text or you can do other stuff with its content. }.
I wish I could vote up! :) Thanks, I try not to bother you guys too much, but after an hour of trying to figure this out I was soo frustrated! – Codygman Nov 7 '09 at 8:21 The different parser subclasses are all good for different kinds of work.
TokeParser is one of the simplest and fastest, but when you want to move up and down in the tag structure, TreeBuilder should be on your mind instead. – hobbs Nov 7 '09 at 8:51 And I'm emphatically not begging for votes, but you now have 21 rep and can upvote me if you so choose, and you should also "accept" one of the answers to your question if you're satisfied. – hobbs Nov 7 '09 at 8:53 Alrigthy!
Will do, I didn't notice that :) – Codygman Nov 7 '09 at 21:20.
You need to change the get_attr("a1") to get_attr("id") here. The get_attr (x) is looking for an attribute with the name x, but you are giving it the value of the attribute, not its name. Incidentally the tag is not a , it is the item tag for a (definition list).
Get_attr('a1') should have probably read get_attr('id') and it would print "a1" I think getting the text content would look like: while ( my $anchor = $parser->get_tag('div') ) { my $content = $parser-get_text('/div'); } Or if you meant the text content of the link it would be: while ( my $anchor = $parser->get_tag('a') ) { if ( my $href = $anchor->get_attr('href') ) { my $content = $parser->get_text('/a'); #subscene.com/english/Sit-Down-Shut-Up-Fi... push @dnldLinks, $1 if $href =~ m! /subtitle-(\d{2,8})\. Aspx!
; }.
Thank you, that helped, the other part of the question is how to get the text of whats between GETTHISCONTENT. Can you help with that? Thanks!
– Codygman Nov 7 '09 at 8:01 1 Thanks for the help, sorry for the confusion, I guess less is more on here. My overall goal is to get the a href link out of the tags in that specified div container. – Codygman Nov 7 '09 at 8:11.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.