Need help with BeautifulSoup(Python) and parsing HTML table?

From BeautifulSoup import BeautifulSoup pageSource='''...omitted for brevity...''' soup = BeautifulSoup(pageSource) alltables = soup. FindAll( "table", {"border":"2", "width":"100%"} ) results= for table in alltables: rows = table. FindAll('tr') lines= for tr in rows: cols = tr.

FindAll('td') for td in cols: text=td.renderContents(). Strip('\n') lines. Append(text) text_table='\n'.

Join(lines) if 'Website' in text_table: results. Append(text_table) print "Number of tables found : " , len(results) for result in results: print(result) yields Number of tables found : 1 Website Last Visited Last Loaded 01/14/2011 stackoverflow. Com 01/10/2011 01/10/2011 Is this close to what you are looking for?

The problem was that td. Contents returns a list of NavigableStrings and soup tags For instance, running print(td. Contents) might yield '', '', '' So picking off the first element of the list makes you miss the a tag.

From BeautifulSoup import BeautifulSoup pageSource='''...omitted for brevity...''' soup = BeautifulSoup(pageSource) alltables = soup. FindAll( "table", {"border":"2", "width":"100%"} ) results= for table in alltables: rows = table. FindAll('tr') lines= for tr in rows: cols = tr.

FindAll('td') for td in cols: text=td.renderContents(). Strip('\n') lines. Append(text) text_table='\n'.

Join(lines) if 'Website' in text_table: results. Append(text_table) print "Number of tables found : " , len(results) for result in results: print(result) yields Number of tables found : 1 Website Last Visited Last Loaded 01/14/2011 stackoverflow. Com 01/10/2011 01/10/2011 Is this close to what you are looking for?

The problem was that td. Contents returns a list of NavigableStrings and soup tags. For instance, running print(td.

Contents) might yield '', '', '' So picking off the first element of the list makes you miss the -tag.

Yeah, this is pretty close but if another table has the same "border", "width" values with unwanted content, it is a problem. How do I limit to only those tables that have 'website' in them (table contents)? Thank you so much btw!

– ThinkCode Jan 26 at 14:41 I've edited the code to show how you might select only those that have the string 'Website' in it. I don't know if 'Website' will always be the first row of the table, so I wrote the code in a more general way -- searching for 'Website' anywhere in the entire table. If you want to search for 'Website' only in the first row, I would process rows0 separately, test for 'Website', then iterate over the rest of the rows with for tr in rows1::.

– unutbu Jan 26 at 17:03 Sir, you the man! It works great, thank you tonnes :) – ThinkCode Jan 26 at 17:36.

I answered a similar question here . Hope it will help you. A lay man solution: alltables = soup.

FindAll( "table", {"border":"2", "width":"100%"} ) t = x for x in soup. FindAll('td') x.renderContents(). Strip('\n') for x in t Output: 'Website', 'Last Visited', 'Last Loaded', '', '01/14/2011\n ', '', ' stackoverflow.Com\n ', '01/10/2011\n ', '', '', '01/10/2011\n ', ''.

Thanks for the link but I am having problems parsing tables and not just anchor tags. In this case, one of the td contents is a URL and I want to grab everything contained in the tag. – ThinkCode Jan 25 at 19:04 oh.. sorry I missed it.

– Tauquir Jan 25 at 19:13.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions