Regular Expression in Python - Parsing html [closed]?

Use Beautifulsoup. Or be sad. HTML and regular expression don't mix.

Here's the entire program: import urllib2 from BeautifulSoup import BeautifulSoup # Grab your html html = urllib2. Urlopen("google.com").read() # Create a soup object soup = BeautifulSoup(html) # Find all the spans, even if they're malformed spans = soup. FindAll("span") # Remove all the spans from the soup object span.extract() for span in spans # Dump your new HTML to stdout.

Print soup.

While I agree, for this particular thing there's no reason to introduce beautiful soup. – Nick Stinemates Jan 15 '09 at 4:49 1 no? How about a span in a comment?

Or as a string in javascript code? Or one that's malformed? – Triptych Jan 15 '09 at 5:00 This is good, however, I think using a list comprehension solely for a side-effect is bad form.

Recommend a plain for loop here. – Dustin Jan 15 '09 at 5:15 1 I should also download a gzipped version of the HTML, wrap it in a try/except block, encode the output, etc. Just trying to keep it simple. – Triptych Jan 15 '09 at 5:29 I second Dustin's opinion.

Don't use a list comprehension if you don't need a list of the results. – nosklo Jan 15 '09 at 13:41.

You have to be careful with regular expressions---they won't work if spans are nested. BeautifulSoup looks like a nice tool.

Will match the tags and anything in between them.

1 -1: there are a lot of cases where this fails, nested s for instance. – nosklo Jan 15 '09 at 13:43.

And B are both regular expressions, then AB is also a regular expression. String pq will match AB. Primitive expressions like the ones described here.

Above, or almost any textbook about compiler construction. A brief explanation of the format of regular expressions follows. Information and a gentler presentation, consult the Regular Expression HOWTO.

Regular expressions can contain both special and ordinary characters. Expressions; they simply match themselves. Characters, so last matches the string 'last'.

Some characters, like '|' or '(', are special. How the regular expressions around them are interpreted. The null byte using the \number notation, e.g. (Dot.) In the default mode, this matches any character except a newline.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions