Python lxml screen scraping?

No screen-scraping library I know "does well with Javascript" -- it's just too hard to anticipate all ways in which JS could alter the HTML DOM dynamically, conditionally &c.

Scape. Py can do this for you. It's as simple as: import scrape s = scrape.Session() s.

Go('yoursite. Com') print s.doc. Text Jump to about 2:40 in this video for an awesome overview from the creator of scrape.Py: pycon.blip.

Tv/file/3261277.

BeautifulSoup (crummy.com/software/BeautifulSoup/) is often the right answer to python html scraping questions.

I know of no Python HTML parsing libraries that handle running javascript in the page being parsed. It's not "simple enough" for the reasons given by Alex Martelli and more. For this task you may need to think about going to a higher level than just parsing HTML and look at web application testing frameworks.

Two that can execute javascript and are either Python based or can interface with Python: PAMIE Selenium Unfortunately I'm not sure if the "unit testing" orientation of these frameworks will actually let you scrape out visible text. So the only other solution would be to do it yourself, say by integrating python-spidermonkey into your app.

Your code is smart and very flexible to extent, I think. How about simply adding handle_starttag() and handle_endtag() to supress the blocks? Class HTML2Text(HTMLParser.

HTMLParser): def __init__(self): HTMLParser.HTMLParser. __init__(self) self. Output = cStringIO.StringIO() self.

Is_in_script = False def get_text(self): return self.output.getvalue() def handle_data(self, data): if not self. Is_in_script: self.output. Write(data) def handle_starttag(self, tag, attrs): if tag == "script": self.

Is_in_script = True def handle_endtag(self, tag): if tag == "script": self. Is_in_script = False def ParseHTML(source): p = HTML2Text() p. Feed(source) text = p.

Get_text() return text.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions