Regular Expression in Python - Parsing html [closed]?

Use Beautifulsoup. Or be sad. HTML and regular expression don't mix.

Here's the entire program: import urllib2 from BeautifulSoup import BeautifulSoup # Grab your html html = urllib2. Urlopen("google.com").read() # Create a soup object soup = BeautifulSoup(html) # Find all the spans, even if they're malformed spans = soup. FindAll("span") # Remove all the spans from the soup object span.extract() for span in spans # Dump your new HTML to stdout.

Print soup.

While I agree, for this particular thing there's no reason to introduce beautiful soup. – Nick Stinemates Jan 15 '09 at 4:49 1 no? How about a span in a comment?

Or as a string in javascript code? Or one that's malformed? – Triptych Jan 15 '09 at 5:00 This is good, however, I think using a list comprehension solely for a side-effect is bad form.

Recommend a plain for loop here. – Dustin Jan 15 '09 at 5:15 1 I should also download a gzipped version of the HTML, wrap it in a try/except block, encode the output, etc. Just trying to keep it simple. – Triptych Jan 15 '09 at 5:29 I second Dustin's opinion.

Don't use a list comprehension if you don't need a list of the results. – nosklo Jan 15 '09 at 13:41.

You have to be careful with regular expressions---they won't work if spans are nested. BeautifulSoup looks like a nice tool.

Will match the tags and anything in between them.

1 -1: there are a lot of cases where this fails, nested s for instance. – nosklo Jan 15 '09 at 13:43.

And B are both regular expressions, then AB is also a regular expression. String pq will match AB. Primitive expressions like the ones described here.

Above, or almost any textbook about compiler construction. A brief explanation of the format of regular expressions follows. Information and a gentler presentation, consult the Regular Expression HOWTO.

Regular expressions can contain both special and ordinary characters. Expressions; they simply match themselves. Characters, so last matches the string 'last'.

Some characters, like '|' or '(', are special. How the regular expressions around them are interpreted. The null byte using the \number notation, e.g. (Dot.) In the default mode, this matches any character except a newline.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Regular Expression in Python - Parsing html [closed]?

Related Questions

Python regular expression for HTML parsing (BeautifulSoup)?

Simple IIS Redirection expression (“”) throws an error: The expression “” contains a repeat expression?

Why this regular does not Match the html using python [closed]?

SpamCop FAQ : SpamCop Parsing and Reporting Service : Parsing and reporting spam with SpamCop - decisions, problems : Why does SpamCop refuse to accept my HTML spam?

Is there a simple way to use regular expressions with TRichEdit controls? Most regular expression libraries use a unique syntax what is the particular syntax used by the VCL library?

Regular expression parse table php [closed]?

Regular Expression in Python - Parsing html [closed]?

Related Questions

Python regular expression for HTML parsing (BeautifulSoup)?

Simple IIS Redirection expression (“*”) throws an error: The expression “*” contains a repeat expression?

Why this regular does not Match the html using python [closed]?

SpamCop FAQ : SpamCop Parsing and Reporting Service : Parsing and reporting spam with SpamCop - decisions, problems : Why does SpamCop refuse to accept my HTML spam?

Is there a simple way to use regular expressions with TRichEdit controls? Most regular expression libraries use a unique syntax what is the particular syntax used by the VCL library?

Regular expression parse table php [closed]?

Simple IIS Redirection expression (“”) throws an error: The expression “” contains a repeat expression?