Please help parse this html table using BeautifulSoup and lxml the pythonic way?

Here's a version that uses HTMLParser. I tried against the contents of pastebin. Com/tu7dfeRJ It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version from HTMLParser import HTMLParser class MyParser(HTMLParser): def __init__(self): HTMLParser.

__init__(self) self. Line = "" self. In_tr = False self.

In_table = False def handle_starttag(self, tag, attrs): if self. In_table and tag == "tr": self. Line = "" self.

In_tr = True if tag=='a': for attr in attrs: if attr0 == 'href': self. Line += attr1 + " " def handle_endtag(self, tag): if tag == 'tr': self. In_tr = False if len(self.

Line): print self. Line elif tag == "table": self. In_table = False def handle_data(self, data): if data == "Website": self.

In_table = 1 elif self. In_tr: data = data.strip() if data: self. Line += data.strip() + " " if __name__ == '__main__': myp = MyParser() myp.

Feed(open('table. Html').read()) Hopefully this addresses everything you need and you can accept this as the answer. Updated as requested.

Here's a version that uses HTMLParser. I tried against the contents of pastebin. Com/tu7dfeRJ.It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.

From HTMLParser import HTMLParser class MyParser(HTMLParser): def __init__(self): HTMLParser. __init__(self) self. Line = "" self.

In_tr = False self. In_table = False def handle_starttag(self, tag, attrs): if self. In_table and tag == "tr": self.

Line = "" self. In_tr = True if tag=='a': for attr in attrs: if attr0 == 'href': self. Line += attr1 + " " def handle_endtag(self, tag): if tag == 'tr': self.

In_tr = False if len(self. Line): print self. Line elif tag == "table": self.

In_table = False def handle_data(self, data): if data == "Website": self. In_table = 1 elif self. In_tr: data = data.strip() if data: self.

Line += data.strip() + " " if __name__ == '__main__': myp = MyParser() myp. Feed(open('table. Html').read()) Hopefully this addresses everything you need and you can accept this as the answer.

Updated as requested.

From lxml import html >>> table_html = """" ... ... ... Website ... Last Visited ... Last Loaded ... ... ... ... ... ... 01/14/2011 ... ... ... ... ... ... ... ... ... 01/10/2011 ... ... ... ... ... """ >>> table = html. Fromstring(table_html) >>> for row in table. Xpath('//table@border="2" and @width="100%"/tbody/tr'): ... for column in row.

Xpath('. /tdposition()=1/a/@href | . /tdposition()>1/text() | self::node()position()=1/td/text()'): ... print column.strip(), ... print ... Website Last Visited Last Loaded google.com 01/14/2011 http://stackoverflow.com 01/10/2011 >>> voila;) of course instead of printing you can add your values to nested lists or dicts;).

Thanks for the lxml implementation! I am yet to check it since we don't have lxml installed yet on our machines. An admin has to do that, so waiting :( Can we convert this code to BeautifulSoup for a quick check?

– ThinkCode Jan 21 at 18:14 Small typo: change 'table = lxml. Fromstring(table_html)' and 'table = html. Fromstring(table_html)' and it will run.

– Spaceghost Jan 21 at 18:16 you could use ElementTree etree implementation(which you can be easy installed by easy_install or pip ) to test it, but i'm not sure if it supports full xpath syntax. – virhilo Jan 21 at 18:17 @Spaceghost thanks, I fixed it:) – virhilo Jan 21 at 18:18 There's an error in the HTML (''; should be ''; ) but oddly enough, lxml is ok with it but ElementTree is not. – Spaceghost Jan 21 at 19:27.

Here's a version that uses elementtree and the limited XPath it provides: from xml.etree. ElementTree import ElementTree doc = ElementTree(). Parse('table.

Html') for t in doc. Findall('. //table'): # there may be multiple tables, check we have the right one if t.

Find('. /tbody/tr/td'). Text == 'Website': for tr in t.

Findall('. /tbody/tr/')1:: # skip the header row tds = tr. Findall('.

/td') print tds00. Attrib'href', tds1.text.strip(), tds2.text.strip() Results: http://google.com 01/14/2011 http://stackoverflow.com 01/10/2011.

I get this error : doc = ElementTree(). Parse('test. Htm') File "/usr/local/Python2.6/lib/python2.6/xml/etree/ElementTree.Py", line 586, in parse parser.

Feed(data) File "/usr/local/Python2.6/lib/python2.6/xml/etree/ElementTree. Py", line 1245, in feed self. _parser.

Parse(data, 0) xml.parsers.expat. ExpatError: syntax error: line 2, column 61 – ThinkCode Jan 21 at 21:14 It is complaining about your xml.. line 2 column 61.Do you want to post the first lines of that file? Either here or in a pastebin?

– Spaceghost Jan 21 at 21:17 Hmmm, looks like it is looking for strict, well formed text? Pastebin.Com/8BZQyB3b – ThinkCode Jan 21 at 21:22 Two questions: Are you on Windows? Is that the complete file?

– Spaceghost Jan 21 at 21:25 I am on centOS.No it is not the complete file, here it is : pastebin. Com/tu7dfeRJ – ThinkCode Jan 217 at 18:56.

Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ. It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Please help parse this html table using BeautifulSoup and lxml the pythonic way?

Related Questions

Equivalent to InnerHTML when using lxml.html to parse HTML?

Python, BeautifulSoup or LXML - Parsing image URL's from HTML using CSS tags?

Locate element using lxml.html vs BeautifulSoup?

WebScraping with BeautifulSoup or LXML.HTML?

Lxml to parse html:wrong result,why?

Parse html with lxml (tag h3)?