Need help with BeautifulSoup(Python) and parsing HTML table?

From BeautifulSoup import BeautifulSoup pageSource='''...omitted for brevity...''' soup = BeautifulSoup(pageSource) alltables = soup. FindAll( "table", {"border":"2", "width":"100%"} ) results= for table in alltables: rows = table. FindAll('tr') lines= for tr in rows: cols = tr.

FindAll('td') for td in cols: text=td.renderContents(). Strip('\n') lines. Append(text) text_table='\n'.

Join(lines) if 'Website' in text_table: results. Append(text_table) print "Number of tables found : " , len(results) for result in results: print(result) yields Number of tables found : 1 Website Last Visited Last Loaded 01/14/2011 stackoverflow. Com 01/10/2011 01/10/2011 Is this close to what you are looking for?

The problem was that td. Contents returns a list of NavigableStrings and soup tags For instance, running print(td. Contents) might yield '', '', '' So picking off the first element of the list makes you miss the a tag.

From BeautifulSoup import BeautifulSoup pageSource='''...omitted for brevity...''' soup = BeautifulSoup(pageSource) alltables = soup. FindAll( "table", {"border":"2", "width":"100%"} ) results= for table in alltables: rows = table. FindAll('tr') lines= for tr in rows: cols = tr.

FindAll('td') for td in cols: text=td.renderContents(). Strip('\n') lines. Append(text) text_table='\n'.

Join(lines) if 'Website' in text_table: results. Append(text_table) print "Number of tables found : " , len(results) for result in results: print(result) yields Number of tables found : 1 Website Last Visited Last Loaded 01/14/2011 stackoverflow. Com 01/10/2011 01/10/2011 Is this close to what you are looking for?

The problem was that td. Contents returns a list of NavigableStrings and soup tags. For instance, running print(td.

Contents) might yield '', '', '' So picking off the first element of the list makes you miss the -tag.

Yeah, this is pretty close but if another table has the same "border", "width" values with unwanted content, it is a problem. How do I limit to only those tables that have 'website' in them (table contents)? Thank you so much btw!

– ThinkCode Jan 26 at 14:41 I've edited the code to show how you might select only those that have the string 'Website' in it. I don't know if 'Website' will always be the first row of the table, so I wrote the code in a more general way -- searching for 'Website' anywhere in the entire table. If you want to search for 'Website' only in the first row, I would process rows0 separately, test for 'Website', then iterate over the rest of the rows with for tr in rows1::.

– unutbu Jan 26 at 17:03 Sir, you the man! It works great, thank you tonnes :) – ThinkCode Jan 26 at 17:36.

I answered a similar question here . Hope it will help you. A lay man solution: alltables = soup.

FindAll( "table", {"border":"2", "width":"100%"} ) t = x for x in soup. FindAll('td') x.renderContents(). Strip('\n') for x in t Output: 'Website', 'Last Visited', 'Last Loaded', '', '01/14/2011\n ', '', ' stackoverflow.Com\n ', '01/10/2011\n ', '', '', '01/10/2011\n ', ''.

Thanks for the link but I am having problems parsing tables and not just anchor tags. In this case, one of the td contents is a URL and I want to grab everything contained in the tag. – ThinkCode Jan 25 at 19:04 oh.. sorry I missed it.

– Tauquir Jan 25 at 19:13.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Need help with BeautifulSoup(Python) and parsing HTML table?

Related Questions

HTML parsing with BeautifulSoup 4 and Python?

Python - BeautifulSoup - HTML Parsing?

Python, BeautifulSoup or LXML - Parsing image URL's from HTML using CSS tags?

Python regular expression for HTML parsing (BeautifulSoup)?

Replace BeautifulSoup with another (standard) HTML parsing module in this Python script?

Parsing HTML Tables to Lists in Python w/o BeautifulSoup?