How can I prevent closing of tags in bad HTML using BeautifulSoup (python)?

BeautifulSoup excels in parsing and extracting data from badly formatted HTML/XML, but if the broken HTML is ambiguous then it uses a set of rules to interpret the tags (which may not be what you want). See the section on Parsing HTML in the docs which ends with an example that sounds very similar to your situation If you know what's wrong with your tags and understand the rules that BeautifulSoup uses, you may be able to augment you HTML slightly (perhaps remove or move certain tags) to make BeautifulSoup return the output you want If you can post a short example, someone might be able to give you more specific help Update (some examples) For example, consider the example given in the docs (linked above): from BeautifulSoup import BeautifulSoup html = """ Row 1 cell 1 Row 2 cell 1 Row 2 cell 2This sure is a long cell """ print BeautifulSoup(html).prettify() The table tag will be closed before form to ensure that the table is properly nested within the form, leaving the last td hanging If we understand the problem, we can get the correct closing tab ( table ) by removing form before parsing: html = html. Replace("", "") >>> soup = BeautifulSoup(html) >>> print soup.prettify() Row 1 cell 1 Row 2 cell 1 Row 2 cell 2 This sure is a long cell >> soup.html.

Insert(0, new_form) # insert form as child of html >>> new_form. Insert(0, soup.table.extract()) # move table into form >>> print soup.prettify() Row 1 cell 1 Row 2 cell 1 Row 2 cell 2 This sure is a long cell.

BeautifulSoup excels in parsing and extracting data from badly formatted HTML/XML, but if the broken HTML is ambiguous then it uses a set of rules to interpret the tags (which may not be what you want). See the section on Parsing HTML in the docs which ends with an example that sounds very similar to your situation. If you know what's wrong with your tags and understand the rules that BeautifulSoup uses, you may be able to augment you HTML slightly (perhaps remove or move certain tags) to make BeautifulSoup return the output you want.

If you can post a short example, someone might be able to give you more specific help. Update (some examples) For example, consider the example given in the docs (linked above): from BeautifulSoup import BeautifulSoup html = """ Row 1 cell 1 Row 2 cell 1 Row 2 cell 2This sure is a long cell """ print BeautifulSoup(html).prettify() The tag will be closed before to ensure that the table is properly nested within the form, leaving the last hanging. If we understand the problem, we can get the correct closing tab () by removing "" before parsing: >>> html = html.

Replace("", "") >>> soup = BeautifulSoup(html) >>> print soup.prettify() Row 1 cell 1 Row 2 cell 1 Row 2 cell 2 This sure is a long cell If the tag IS important, you can still add it after parsing. For example: >>> new_form = Tag(soup, "form") # create form element >>> soup.html. Insert(0, new_form) # insert form as child of html >>> new_form.

Insert(0, soup.table.extract()) # move table into form >>> print soup.prettify() Row 1 cell 1 Row 2 cell 1 Row 2 cell 2 This sure is a long cell.

BeautifulSoup excels in parsing and extracting data from badly formatted HTML/XML, but if the broken HTML is ambiguous then it uses a set of rules to interpret the tags (which may not be what you want). See the section on Parsing HTML in the docs which ends with an example that sounds very similar to your situation.

Setting formatter=None may help (http://www.crummy.com/software/BeautifulSoup/bs4/doc/#output-formatters), but this might be an indication that your HTML is invalid. If that doesn't work, can you provide some sample code and HTML which reproduces the problem? Terms of service.

Not the answer you're looking for?

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions