Regular expression to extract text from HTML?

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle!

CDATA sections correctly at all. Further, some kinds of common HTML things like text will work in a browser as proper text, but might baffle a naive RE You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts Also, browsers, by design, tolerate malformed HTML.So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser You might be able to parse bad HTML with RE's.

All it requires is patience and hard work. But it's often simpler to use someone else's parser.

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle will work in a browser as proper text, but might baffle a naive RE.

You'll be happier and more successful with a proper HTML parser. Python folks often use something Beautiful Soup to parse HTML and strip out tags and scripts. Also, browsers, by design, tolerate malformed HTML.So you will often find yourself trying to parse HTML which is clearly improper, but happens to work okay in a browser.

You might be able to parse bad HTML with RE's. All it requires is patience and hard work. But it's often simpler to use someone else's parser.

Definitely use a specialized HTML parser - don't roll your own! I just wanted to suggest Hpricot if you're using Ruby. – Neall Oct 8 '08 at 2:52 Why should baffle a RE?

Most would just be setup to ignore it, which is correct: it's text, not HTML. If it's because they parse HTML entities (a good idea I suppose) you should be doing that on the text AFTER your RE's, not on the HTML anyway... – Matthew Scharley Oct 8 '08 at 10:19 3 @monoxide: My point is not that it's impossible. My point is that you can save a lot of debugging of RE's by using someone else's parser that handles all the edge cases correctly.

– S. Lott Oct 8 '08 at 12:36 +1 but I think the point about malformed HTML is irrelevant here since we specifically aren't trying to parse the HTML it's ok to have a regex which just pulls out anything which looks like a tag regardless of structure. – annakata Dec 8 '08 at 11:21 @annakata: "pulling out anything which looks like a tag" more-or-less IS parsing.

Because HTML is a language that is more complex than RE's are designed to describe, parsing is about the only way to find anything in HTML. RE's are always defeated except in trivial cases. – S.

Lott Dec 8 '08 at 11:25.

Contemplating doing this with regular expressions is daunting. Have you considered XSLT? The XPath expression to extract all of the text nodes in an XHTML document, minus script & style content, would be: //body//text()not(ancestor::script)not(ancestor::style).

Simple and Elegant == Beautiful. – Pablo Fernandez Oct 8 '08 at 1:56 That would probably work, except that it would also return text (ie. Code) from within tags.

– Kibbee Oct 8 '08 at 2:00 True enough, see edit. There may be other special cases, but that's the general idea. – Chris Noe Oct 8 '08 at 2:19 Will not work on real world HTML pages, ie the HTML is malformed non-XHTML.

Most XML parsers don't support "real-world HTML". That's why I've used HtmlAgilityPack (Google it) for exactly this type of task in the past. – Ashley Henderson Apr 29 '09 at 8:42 Indeed, that is a consistent pain.

Another option is to pre-process the page with tidy. – Chris Noe Apr 29 '09 at 18:17.

Using perl syntax for defining the regexes, a start might be:!(.*)! Smi Then applying the following replace to the result of that group:! Smi!

+/ \t*>! Smi! Smi ///smi This of course won't format things nicely as a text file, but it strip out all the HTML (mostly, there's a few cases where it might not work quite right).

A better idea though is to use an XML parser in whatever language you are using to parse the HTML properly and extract the text out of that.

If you're using PHP, try Simple HTML DOM, available at SourceForge. Otherwise, Google html2text, and you'll find a variety of implementations for different languages that basically use a series of regular expressions to suck out all the markup. Be careful here, because tags without endings can sometimes be left in, as well as special characters such as & (which is &).

Also, watch out for comments and Javascript, as I've found it's particularly annoying to deal with for regular expressions, and why I generally just prefer to let a free parser do all the work for me.

I believe you can just do document.body. InnerText Which will return the content of all text nodes in the document, visible or not. Edit (olliej): sigh nevermind, this only works in Safari and IE, and I can't be bothered downloading a firefox nightly to see if it exists in trunk :-/ .

Nope, that is undefined in FF3 – Chris Noe Oct 8 '08 at 12:49 textContent is a standard equivalent – porneL Oct 12 '08 at 19:55.

I use iMacros for firefox for extracting stock quotes. It includes a useful general purpose text extraction feature. Https://addons.mozilla.Org/en-US/firefox/addon/3863 wiki: Text Extraction Jim2.

The simplest way for simple HTML (example in Python): text = "This is my> exampleHTML, containing tags" import re " ". Join(t.strip() for t in re. Findall(r"+>|^ example HTML, containing tags.

Here's a function to remove even most complex html tags. Function strip_html_tags( $text ) { $text = preg_replace( array( // Remove invisible content '@*? >.

*? @siu', '@@siu', '@@siu', '@*?. *?

@siu', '@*?. *? @siu', '@*?.

*? @siu', '@*?. *?

@siu', '@*?. *? @siu', '@*?.

*? @siu', // Add line breaks before & after blocks.

System.Windows.Forms. WebBrowser wc = new System.Windows.Forms.WebBrowser(); wc. DocumentText = "blah blahfoo"; System.Windows.Forms.

HtmlDocument h = wc. Document; Console. WriteLine(h.Body.

InnerText).

You can't really parse HTML with regular expressions. It's too complex. RE's won't handle.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions