There's no way to do this that's guaranteed to work, but one strategy you might use is to try to find the element with the most visible text inside of it.
Up vote 9 down vote favorite 3 share g+ share fb share tw.
Given a news article webpage (from any major news source such as times or bloomberg), I want to identify the main article content on that page and throw out the other misc elements such as ads, menus, sidebars, user comments. What's a generic way of doing this that will work on most major news sites? What are some good tools or libraries for data mining?
(preferably python based) python html-parsing data-mining webpage web-scraping link|improve this question edited Jan 13 '11 at 9:25Piotr Dobrogost4,91712257 asked Jan 12 '11 at 17:46kefeizhou1,152315 78% accept rate.
2 see how Readability bookmark is implemented lab.arc90.com/experiments/readability – J.F. Sebastian Jan 12 '11 at 18:07 A browser that does this would be a huge threat to online ads. – Emilio M Bumachar Jan 12 '11 at 18:29.
There are a number of ways to do it, but, none will always work. Here are the two easiest: if it's a known finite set of websites: in your scraper convert each url from the normal url to the print url for a given site (cannot really be generalized across sites) Use the arc90 readability algorithm (reference implementation is in javascript) code.google.com/p/arc90labs-readability/ . The short version of this algorithm is it looks for divs with p tags within them.
It will not work for some websites but is generally pretty good.
2 +1 for Readability. Since Readability works best for article pages as opposed to homepages it would work best when an RSS feed is parsed for article URLs. – nedk Jan 12 '11 at 18:21 I should've added links to the python ports of the readability algorithm: github.com/… – gte525u Jan 12 '11 at 20:49.
It might be more useful to extract the RSS feeds () on that page and parse the data in the feed to get the main content.
NB: for ATOM feeds type="application/atom+xml" – nedk Jan 12 '11 at 17:59 A good idea, but this can be hit or miss, since a lot of feeds only include an article summary. Which is understandable, since the point of most news sites is to get you to view ads, which you generally won't inside an RSS reader. – Cerin Jan 13 '11 at 2:23.
A while ago I wrote a simple Python script for just this task. It uses a heuristic to group text blocks together based on their depth in the DOM. The group with the most text is then assumed to be the main content.
It's not perfect, but works generally well for news sites, where the article is generally the biggest grouping of text, even if broken up into multiple div/p tags. You'd use the script like: python html2text.py.
Another possibility of separating "real" content from noise is by measuring HTML density of the parts of a HTML page. You will need a bit of experimentation with the thresholds to extract the "real" content, and I guess you could improve the algorithm by applying heuristics to specify the exact bounds of the HTML segment after having identified the interesting content. Update: Just found out the URL above does not work right now; here is an alternative link to a cached version of archive.org.
I wouldn't try to scrape it from the web page - too many things could mess it up - but instead see which web sites publish RSS feeds. For example, the Guardian's RSS feed has most of the text from their leading articles: feeds.guardian.co.uk/theguardian/rss I don't know if The Times (The London Times, not NY) has one because it's behind a paywall. Good luck with that...
Most of the RSS feeds I've seen only have short abstracts of the full articles. – kefeizhou Jan 12 '11 at 18:02.
I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.