Please help parse this html table using BeautifulSoup and lxml the pythonic way?

Here's a version that uses HTMLParser. I tried against the contents of pastebin. Com/tu7dfeRJ It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version from HTMLParser import HTMLParser class MyParser(HTMLParser): def __init__(self): HTMLParser.

__init__(self) self. Line = "" self. In_tr = False self.

In_table = False def handle_starttag(self, tag, attrs): if self. In_table and tag == "tr": self. Line = "" self.

In_tr = True if tag=='a': for attr in attrs: if attr0 == 'href': self. Line += attr1 + " " def handle_endtag(self, tag): if tag == 'tr': self. In_tr = False if len(self.

Line): print self. Line elif tag == "table": self. In_table = False def handle_data(self, data): if data == "Website": self.

In_table = 1 elif self. In_tr: data = data.strip() if data: self. Line += data.strip() + " " if __name__ == '__main__': myp = MyParser() myp.

Feed(open('table. Html').read()) Hopefully this addresses everything you need and you can accept this as the answer. Updated as requested.

Here's a version that uses HTMLParser. I tried against the contents of pastebin. Com/tu7dfeRJ.It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.

From HTMLParser import HTMLParser class MyParser(HTMLParser): def __init__(self): HTMLParser. __init__(self) self. Line = "" self.

In_tr = False self. In_table = False def handle_starttag(self, tag, attrs): if self. In_table and tag == "tr": self.

Line = "" self. In_tr = True if tag=='a': for attr in attrs: if attr0 == 'href': self. Line += attr1 + " " def handle_endtag(self, tag): if tag == 'tr': self.

In_tr = False if len(self. Line): print self. Line elif tag == "table": self.

In_table = False def handle_data(self, data): if data == "Website": self. In_table = 1 elif self. In_tr: data = data.strip() if data: self.

Line += data.strip() + " " if __name__ == '__main__': myp = MyParser() myp. Feed(open('table. Html').read()) Hopefully this addresses everything you need and you can accept this as the answer.

Updated as requested.

From lxml import html >>> table_html = """" ... ... ... Website ... Last Visited ... Last Loaded ... ... ... ... ... ... 01/14/2011 ... ... ... ... ... ... ... ... ... 01/10/2011 ... ... ... ... ... """ >>> table = html. Fromstring(table_html) >>> for row in table. Xpath('//table@border="2" and @width="100%"/tbody/tr'): ... for column in row.

Xpath('. /tdposition()=1/a/@href | . /tdposition()>1/text() | self::node()position()=1/td/text()'): ... print column.strip(), ... print ... Website Last Visited Last Loaded google.com 01/14/2011 http://stackoverflow.com 01/10/2011 >>> voila;) of course instead of printing you can add your values to nested lists or dicts;).

Thanks for the lxml implementation! I am yet to check it since we don't have lxml installed yet on our machines. An admin has to do that, so waiting :( Can we convert this code to BeautifulSoup for a quick check?

– ThinkCode Jan 21 at 18:14 Small typo: change 'table = lxml. Fromstring(table_html)' and 'table = html. Fromstring(table_html)' and it will run.

– Spaceghost Jan 21 at 18:16 you could use ElementTree etree implementation(which you can be easy installed by easy_install or pip ) to test it, but i'm not sure if it supports full xpath syntax. – virhilo Jan 21 at 18:17 @Spaceghost thanks, I fixed it:) – virhilo Jan 21 at 18:18 There's an error in the HTML (''; should be ''; ) but oddly enough, lxml is ok with it but ElementTree is not. – Spaceghost Jan 21 at 19:27.

Here's a version that uses elementtree and the limited XPath it provides: from xml.etree. ElementTree import ElementTree doc = ElementTree(). Parse('table.

Html') for t in doc. Findall('. //table'): # there may be multiple tables, check we have the right one if t.

Find('. /tbody/tr/td'). Text == 'Website': for tr in t.

Findall('. /tbody/tr/')1:: # skip the header row tds = tr. Findall('.

/td') print tds00. Attrib'href', tds1.text.strip(), tds2.text.strip() Results: http://google.com 01/14/2011 http://stackoverflow.com 01/10/2011.

I get this error : doc = ElementTree(). Parse('test. Htm') File "/usr/local/Python2.6/lib/python2.6/xml/etree/ElementTree.Py", line 586, in parse parser.

Feed(data) File "/usr/local/Python2.6/lib/python2.6/xml/etree/ElementTree. Py", line 1245, in feed self. _parser.

Parse(data, 0) xml.parsers.expat. ExpatError: syntax error: line 2, column 61 – ThinkCode Jan 21 at 21:14 It is complaining about your xml.. line 2 column 61.Do you want to post the first lines of that file? Either here or in a pastebin?

– Spaceghost Jan 21 at 21:17 Hmmm, looks like it is looking for strict, well formed text? Pastebin.Com/8BZQyB3b – ThinkCode Jan 21 at 21:22 Two questions: Are you on Windows? Is that the complete file?

– Spaceghost Jan 21 at 21:25 I am on centOS.No it is not the complete file, here it is : pastebin. Com/tu7dfeRJ – ThinkCode Jan 217 at 18:56.

Here's a version that uses HTMLParser. I tried against the contents of pastebin.com/tu7dfeRJ. It copes with the meta tag and doctype declaration, both of which foiled the ElementTree version.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions