How can I extract XML of a website and save in a file using Perl's LWP?

Of course. The easiest way would be the Web::Scraper module. What it does is it lets you define scraper objects that consist of.

Of course. The easiest way would be the Web::Scraper module. What it does is it lets you define scraper objects that consist of hash key names, XPath expressions that locate elements of interest, and code to extract bits of data from them.

Scraper objects take a URL and return a hash of the extracted data. The extractor code for each key can itself be another scraper object, if necessary, so that you can define how to scrape repeated compound page elements: provide the XPath to find the compound element in an outer scraper, then provide a bunch more XPaths to pull out its individual bits in an inner scraper. The result is then automatically a nested data structure.In short, you can very elegantly suck data from all over a page into a Perl data structure.

In doing so, the full power of XPath + Perl is available for use against any page. Since the page is parsed with HTML::TreeBuilder, it does not matter how nasty a tagsoup it is. The resulting scraper scripts are much easier to maintain and far more tolerant of minor markup variations than regex-based scrapers.

Bad news: as yet, its documentation is almost non-existent, so you have to get by with googling for something like miyagawa web::scraper to find example scripts posted by the module’s author.

Also see datenzoo. De/pub/gpw2008/web-scraper/… is my German talk about Web::Scraper. Automatic translation:66.196.80.202/babelfish/… – Corion Oct 21 '08 at 8:11 Do you really want to recommend this kind of beta module?

– Account deleted Oct 21 '08 at 8:43 Beta, really? It’s glue for a combo of LWP, HTML::TreeBuilder and HTML::Selector::XPath, all battle-tested production-quality modules. If you enjoy writing boilerplate, though, suit yourself… – Aristotle Pagaltzis Oct 21 '08 at 8:52 I haven't tried it so perhaps I jumped to conclusions.

But the author notes "THIS MODULE IS IN ITS BETA QUALITY. THE API IS STOLEN FROM SCRAPI BUT MAY CHANGE IN THE FUTURE" – Account deleted Oct 21 '08 at 11:13.

While in general LWP::Simple or WWW::Mechanize and HTML::Tree are good ways to extract data from web pages, in this particular case (TV listings) there's a much easier way: Use XMLTV with data from Schedules Direct. There is a small fee (US$20/year), but there are advantages: The parsing code is already written for you (just use XMLTV;). You won't be violating Yahoo's terms of service.

You won't have to deal with Yahoo actively trying to break your script. (They don't like automated scripts pulling down TV listings; see #2. ).

If you want to pass the information to Javascript, use Javascript Object Notation (JSON) instead of XML. There are plenty of Perl libraries, such as JSON::Any, that can handle that for you.

Tv.yahoo. Com is not very semantic and not very easy to scrape! They're maybe better alternatives or feeds?

Using pQuery I can quickly get times & shows.... use pQuery; pQuery( 'tv.yahoo.com/listings'" rel="nofollow">tv.yahoo.com/listings' ) ->find( '. Show' )->each( sub { my $n = shift; my $pQ = pQuery( $_ ); say $pQ->text; } ); # => 4:00pm - 6:30pm Local Programming To scrape details a bit more you can try this.... use pQuery; my @tv_progs; pQuery( 'tv.yahoo.com/listings'" rel="nofollow">tv.yahoo.com/listings' ) ->find( 'li div strong' )->each( sub { my $n = shift; my $pQ = pQuery( $_ ); $tv_progs $n ->{ time } = $pQ->text; } ) ->end ->find( '. ShowTitle' )->each( sub { my $n = shift; my $pQ = pQuery( $_ ); $tv_progs $n ->{ name } = $pQ->text; } ); for my $prog ( @tv_progs ) { say $prog->{name} ." @ " .

$prog->{time}; } # => Local Programming @ 4:00pm - 6:30pm And to get channel.... use pQuery; pQuery( 'tv.yahoo.com/listings'" rel="nofollow">tv.yahoo.com/listings' ) ->find( '. Chhdr a' )->each( sub { my $n = shift; my $pQ = pQuery( $_ ); say $pQ->text; } ); # => ABC However matching back channel to programme info will require a bit of work ;-) /I3az.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions