RegEx match open tags except XHTML self-contained tags?

Discover How To Stop The Daily Pain And Heart Wrenching Suffering, Put An End To The Lying, Face The Truth About Your Marriage, And Create A New, Peaceful, Harmonious And Joyous Marriage Get it now!

You can't parse XHTML with regex. Because HTML can't be parsed by regex. Regex is not a tool that can be used to correctly parse HTML.As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.

Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML. HTML is not a regular language and hence cannot be parsed by regular expressions. Regex queries are not equipped to break down HTML into its meaningful parts.

So many times but it is not getting to me. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. You will never make me crack.

HTML is a language of sufficient complexity that it cannot be parsed by regular expressions. Even Jon Skeet cannot parse HTML using regular expressions. Every time you attempt to parse HTML with regular expressions, the unholy child weeps the blood of virgins, and Russian hackers pwn your webapp.

Parsing HTML with regex summons tainted souls into the realm of the living. HTML and regex go together like love, marriage, and ritual infanticide. The cannot hold it is too late.

The force of regex and HTML together in the same conceptual space will destroy your mind like so much watery putty. If you parse HTML with regex you are giving in to Them and their blasphemous ways which doom us all to inhuman toil for the One whose Name cannot be expressed in the Basic Multilingual Plane, he comes. HTML-plus-regexp will liquify the n​erves of the sentient whilst you observe, your psyche withering in the onslaught of horror.

Rege̿̔̉x-based HTML parsers are the cancer that is killing StackOverflow it is too late it is too late we cannot be saved the trangession of a chiÍ¡ld ensures regex will consume all living tissue (except for HTML which it cannot, as previously prophesied) dear lord help us how can anyone survive this scourge using regex to parse HTML has doomed humanity to an eternity of dread torture and security holes using regex as a tool to process HTML establishes a breach between this world and the dread realm of c͒ͪo͛ͫrrupt entities (like SGML entities, but more corrupt) a mere glimpse of the world of reg​ex parsers for HTML will ins​tantly transport a programmer's consciousness into a world of ceaseless screaming, he comes, the pestilent slithy regex-infection wil​l devour your HT​ML parser, application and existence for all time like Visual Basic only worse he comes he comes do not fi​ght he comÌ¡e̶s, Ì•h̵i​s un̨hoÍžly radian͝ce͝ destroÒ‰ying all enlḯ̝̂̈́ghtenment, HTML tags leaÍ ki̧n͘g fr̶ǫm Ì¡yo​͟ur eyeÍ¢s̸ Ì›lÌ•ik͝e liq​uid pain, the song of re̸gular exp​ression parsing will exti​nguish the voices of mor​tal man from the sp​here I can see it can you see ̲͚̖͔̙î̩͝t̝̲͎̩̱͔̋̀ it is beautiful t​he final snuffing of the lie​s of Man ALL IS LOS̝̝͖̩͇̗̪̈́T ALL I​S LOST the ponÌ·y he comes he c̶̮omes he comes the ich​or permeates all MY FACE MY FACE áµ’h god no NO NOO̼O​O NΘ stop the an​*̶͑̾̾​̙̤̅ͫ͝g͇̫͛͆̾ͫ̑͆l̝͖͉̗̩̳̟ͫͥͨeÌ…Ì s ÍŽa̧͈͖r̽̾̈́͒͑e n​ot rè̑ͧ̌aͨl̘̝̙̃ͤ͂̾̆ ZAÍ Ì¡ÍŠÍLGÎŒ ISͮ̂҉̯͈͕̹̘̱ TO͇̹̺ͅƝ̴ȳ̳ TH̘Ë͖́̉ Í P̯̭͝O̚​N̝YÌ¡ H̸̡̪̯ͨ͊̽̅̾̎Ȩ̝̬̩̾͛ͪ̈́̀͘ ̶̧̨̱̹̭̯ͧ̾ͬC̷̝̙̲̝͖ͭͥͮ͟O̮̪̝ͮ͝͝M̲̖͊̒ͪͩͬ̚̚͜Ȇ̴̟̟͙̞ͩ͌͝S̨̥̫͎̭ͯ̿̔̀ͅ Have you tried using an XML parser instead?

– Kobi Nov 13 '09 at 23:07 177 Kobi: I think it's time for me to quit the post of Assistant Don't Parse HTML With Regex URL1 matter how many times we say it, they won't stop coming every day... every hour even. It is a lost cause, which someone else can fight for a bit. So go on, parse HTML with regex, if you must.It's only broken code, not life and death.

– bobince Nov 13 '09 at 23:18 218 ++ for "The cannot hold" – Horace Loeb Nov 13 '09 at 23:27 755 Chuck Norris can parse HTML with regex. – THX-1138 Nov 14 '09 at 0:03 392 A true work of art; I weep at the poetic beauty. – Marc Gravell?

8 Nov1 at 0:29.

While it is true that asking regexes to parse arbitrary HTML is like asking Paris lton to write an operating system, it's sometimes appropriate to parse a limited, known set of HTML. If you have a small set of HTML pages that you want to scrape data from and then stuff into a database, regexes might work fine. For example, I recently wanted to get the names, parties, and districts of Australian federal Representatives, which I got off of the Parliament's Web site.

This was a limited, one-time job. Regexes worked just fine for me, and were very fast to set up.

361 +1 for incorporating Paris – Andrew Song Nov 14 '09 at 18:49 205 Great, we're now debating the possibility of chuck norris parsing HTML with regular expressions .. and paris hilton writing an operating system. Jon Skeet, however, can do both AND paris hilton. – Tim Post?

Nov 16 '09 at 5:12 32 Oh good grief. To clarify, I did not say Jon Skeet DID Paris I just said he COULD, since he is (probably) appropriately anatomically equipped to do so. Enough with the prank e-mails asking for photographic evidence.An interlude between Jon Skeet and Tony The Pony is MUCH funnier than an interlude between Jon Skeet and Paris thank you.

110 emails so far, please, TAKE MY COMMENT AS HUMOR. – Tim Post? Nov 24 '09 at 18:58 51 In Soviet Russia HTML parses you.

– parxier Mar 14 '10 at 13:01 42 Er, Paris .. Microsoft bought it and called it "Windows ME" – Robert Fraser May 31 '10 at 10:34.

I think the flaw here is that HTML is a Chomsky Type 2 grammar (context free grammar) and RegEx is a Chomsky Type 3 grammar (regular expression). Since a Type 2 grammar is fundamentally more complex than a Type 3 grammar - you can't possibly hope to make this work. But many will try, some will claim success and others will find the fault and totally mess you up.

Chomsky erarchy.

49 this is a very good answer – Paul Nathan Jan 4 '10 at 18:25 39 +1 for science level – mico Jul 5 '10 at 11:07 24 This is not actually the case. RegEx in most programming languages is actually context-free, due to the fact that it has look-backs, etc. – Michael Fairley Jul 5 '10 at 15:27 14 @michaelfairley Look ahead/behind/around features provide a richer syntax for expressing certain classes of regular expression. I do not believe these features provide fundamentally any more expressive power than a Chomsky type 3 grammar is capable of.

One might argue that HTML is a visibly pushdown language (VPL) so may be parsed using techniques less powerful than required for a full blown context free grammar, however, I am unaware of any RegEx engine that support VPL's either. – NealB Jul 5 '10 at 18:12 18 @Peter Jaric. HTML contains nested structures.

These must be dealt with or you will be led astray. For example, using Regex to pull out the anchor tag in something like: 3"> is a challenge because the first > appears within the context of a text string (different nesting level). No matter what you do, there exists some valid HTML sequence that Regex will mess up on.

Too many people think they can ignore or work around this fact. They all eventually get themselves into trouble because of it. – NealB Jul 5 '10 at 15:30.

Disclaimer: use a parser if you have the option. That said... This is the regex I use (!) to match HTML tags: )+> It may not be perfect, but I ran this code through a lot of HTML. Note that it even catches strange things like , which show up on the web.

I guess to make it not match self contained tags, you'd either want to use Kobi's negative look-behind: )+(? or just combine if and if not. To downvoters: This is working code from an actual product. I doubt anyone reading this page will get the impression that it is socially acceptable to use regexes on HTML.

Caveat: I should note that this regex still breaks down in the presence of CDATA blocks, comments, and script and style elements. Good news is, you can get rid of those using a regex...

5 Nothing to complain about, just three down votes. I'm at -5, probably for not adding a warning not to use my code :) – Kobi Nov 16 '09 at 14:10 3 Up for karma. And this makes 15 characters.

– Jeff Nov 16 '09 at 14:31 1 I got a couple of anonymous down votes that were really about minor differences in opinion. I didn't like it. I mean, you put a disclaimer right at the front, right?

One up for karma. – Stephen Harmon Mar 30 '10 at 14:53 1 111 up for karma – zildjohn01 May 28 '10 at 10:11 13 +1 This is definitely a helpful answer given all the caveats. – Christian Hayter Jun 14 '10 at 8:57.

Don't listen to these guys. You actually can parse context-free grammars with regex; all you need to do is solve the halting problem. After that it's pretty trivial - you just need an algorithm to losslessly compress random data, work out the Traveling Salesman Problem in O(log n), and divide the whole thing by zero.

Easy-peasy. Haven't figured out the last part yet, but I'm working on it. My code keeps throwing CthulhuRlyehWgahnaglFhtagnExceptions lately, so I'm setting up a catch block to consume those and resume parsing.

I'll update with the code once I investigate this strange door that just opened in the wall. Hmm. Pierre de Fermat also figured out how to do it, but the margin he was writing in wasn't big enough for the code.

31 Getting past the whole CthulhuRlyehWgahnaglFhtagnException is easy. It's the part where my algorithm stops working, changes its name to Skynet, and starts hunting babies from the past that I cannot seem to figure out. – Moses Mar 8 at 19:02 16 Plz send me teh codez – Christian Hayter Mar 9 at 18:34 19 @Moses - I've seen that one before.

Common mistake, you want to set the BecomeSelfAwareAndKillUsAll flag to false. – Justin Morgan Mar 10 at 21:07 22 +1 for the last line - I wonder how many people actually got that. – new123456 Apr 7 at 0:08 7 Excellent.

There are now 10 useless answers before the relevant sample code in the most linked-to question of Stack Overflow. – Kobi May 1 at 4:20.

5 Yes, especially given this comment "I'm parsing a block of XHTML, truncating it, then closing any tags that are left open after it's been truncated. The DOM XML stuff doesn't work because it's not properly formed XML. " Use BeautifulSoup to truncate and prettify.

– Mark Nov 15 '09 at 18:51.

Someone wrote a full html parser for PHP: htmlpurifier.org.

2 This is a great parser +1 – alex Feb 12 '10 at 3:26.

I suggest using QueryPath for parsing XML and HTML in PHP. It's basically much the same syntax as jQuery, only it's on the server side.

1 +1 I really like how easy html parsing is with jQuery, didn't know there was something similar for server side. – Kyle Sep 14 '10 at 20:33.

I find this small PHP library incredibly useful for parsing HTML tags: simplehtmldom.sourceforge.net/.

1 Yep, this is the usual thing for HTML, when it's not well-formed XHTML anyway. – bobince Nov 13 '09 at 23:34.

You can parse html in sed though. Turing. Sed Write html parser (homework)?

Profit!

6 Great handwaving going on here. – Mike Feb 8 at 17:51.

Foo name'; $dom = new DOMDocument(); $dom->loadHTML($html); $els = $dom->getElementsByTagName('*'); foreach ( $els as $el ) { $nodeName = strtolower($el->nodeName); if (!in_array( $nodeName, $selfClosing ) ) { var_dump( $nodeName ); } } Output: string(4) "html" string(4) "body" string(1) "p" string(1) "a" string(3) "div" Basically just define the element node names that are self closing, load the whole html string into a DOM library, grab all elements, loop through and filter out ones which aren't self closing and operate on them. I'm sure you already know by now that you shouldn't use regex for this purpose.

There are persons that will tell you that the Earth is round (or perhaps that the Earth is an oblate spheroid, if they want to use strange words). They are lying. There are persons that will tell you that Regular Expressions shouldn't be recursive.

They are limiting you. They need to subjugate you, and they do it by keeping you in ignorance. You can live in their reality or take the red pill.

Like the Lord Marshal (is he a relative of the Marshal . NET class? ), I have seen the Underverse Stack Based Regex-Verse and returned with powers knowledge you can't imagine.

Yes, I think there were an Old One or two protecting them, but they were watching football on the TV, so it wasn't difficult. I think the XML case is quite simple. The RegEx (in the .

NET syntax), deflated and coded in base64 to make it easier to comprehend by your feeble mind, should be something like this: 7L0HYBxJliUmL23Ke39K9UrX4HShCIBgEyTYkEAQ7MGIzeaS7B1pRyMpqyqBymVWZV1mFkDM7Z28 995777333nvvvfe6O51OJ/ff/z9cZmQBbPbOStrJniGAqsgfP358Hz8itn6Po9/3eIue3+Px7/3F 86enJ8+/fHn64ujx7/t7vFuUd/Dx65fHJ6dHW9/7fd/t7fy+73Ye0v+f0v+Pv//JnTvureM3b169 OP7i9Ogyr5uiWt746u+BBqc/8dXx86PP7tzU9mfQ9tWrL18d3UGnW/z7nZ9htH/y9NXrsy9fvPjq i5/46ss3p4z+x3e8b452f9/x93a2HxIkH44PpgeFyPD6lMAEHUdbcn8ffTP9fdTrz/8rBPCe05Iv p9WsWF788Obl9MXJl0/PXnwONLozY747+t7x9k9l2z/4vv4kqo1//993+/vf2kC5HtwNcxXH4aOf LRw2z9/v8WEz2LTZcpaV1TL/4c3h66ex2Xv95vjF0+PnX744PbrOm59ZVhso5UHYME/dfj768H7e Yy5uQUydDAH9+/4eR11wHbqdfPnFF6cv3ogq/V23t++4z4620A13cSzd7O1s/77rpw+ePft916c7 O/jj2bNnT7e/t/397//M9+ibA/7s6ZNnz76PP0/kT2rz/Ts/s/0NArvziYxVEZWxbm93xsrUfnlm rASN7Hf93u/97vvf+2Lx/e89L7+/FSXiz4Bkd/hF5mVq9Yik7fcncft9350QCu+efkr/P6BfntEv z+iX9c4eBrFz7wEwpB9P+d9n9MfuM3yzt7Nzss0/nuJfbra3e4BvZFR7z07pj3s7O7uWJM8eCkme nuCPp88MfW6kDeH7+26PSTX8vu+ePAAiO4LVp4zIPWC1t7O/8/+pMX3rzo2KhL7+8s23T1/RhP0e vyvm8HbsdmPXYDVhtpdnAzJ1k1jeufOtUAM8ffP06Zcnb36fl6dPXh2f/F6nRvruyHfMd9rgJp0Y gvsRx/6/ZUzfCtX4e5hTndGzp5jQo9e/z+s3p1/czAUMlts+P3tz+uo4tISd745uJxvb3/v4ZlWs mrjfd9SG/swGPD/6+nh+9MF4brTBRmh1Tl5+9eT52ckt5oR0xldPzp7GR8pfuXf5PWJv4nJIwvbH W3c+GY3vPvrs9zj8Xb/147/n7/b7/+52DD2gsSH8zGDvH9+i9/fu/PftTfTXYf5hB+9H7P1BeG52 MTtu4S2cTAjDizevv3ry+vSNb8N+3+/1po2anj4/hZsGt3TY4GmjYbEKDJ62/pHB+3/LmL62wdsU 1J18+eINzTJr3dMvXr75fX7m+MXvY9XxF2e/9+nTgPu2bgwh5U0f7u/74y9Pnh6/OX4PlA2UlwTn xenJG8L996VhbP3++PCrV68QkrjveITxr2TIt+lL+f3k22fPn/6I6f/fMqZvqXN/K4Xps6sazUGZ GeQlar49xEvajzI35VRevDl78/sc/b7f6jkG8Va/x52N4L9lBe/kZSh1hr9fPj19+ebbR4AifyuY 12efv5CgGh9TroR6Pj2l748iYxYgN8Z7pr0HzRLg66FnRvcjUft/45i+pRP08vTV6TOe2N/9jv37 R9P0/5YxbXQDeK5E9R12XdDA/4zop+/9Ht/65PtsDVlBBUqko986WsDoWqvbPD2gH/T01DAC1NVn 3/uZ0feZ+T77fd/GVMkA4KjeMcg6RcvQLRl8HyPaWVStdv17PwHV0bOB9xUh7rfMp5Zu3icBJp25 D6f0NhayHyfI3HXHY6YYCw7Pz17fEFhQKzS6ZWChrX+kUf7fMqavHViEPPKjCf1/y5hukcyPTvjP mHQCppRDN4nbVFPaT8+ekpV5/TP8g/79mVPo77PT1/LL7/MzL7548+XvdfritflFY00fxIsvSQPS mvctdYZpbt7vxKRfj3018OvC/hEf/79lTBvM3debWj+b8KO0wP+3OeM2aYHumuCAGonmCrxw9cVX X1C2d4P+uSU7eoBUMzI3/f9udjbYl/el04dI7s8fan8dWRjm6gFx+NrKeFP+WX0CxBdPT58df/X8 DaWLX53+xFdnr06f/szv++NnX7x8fnb6NAhIwsbPkPS7iSUQAFETvP2Tx8+/Og0Xt/yBvDn9vd/c etno8S+81QKXptq/ffzKZFZ+4e/743e8zxino+8RX37/k595h5/H28+y7fPv490hQdJ349E+txB3 zPZ5J/jsR8bs/y1j2hh/2fkayOqEmYcej0cXUWMN7QrqBwjDrVZRfyQM3xjj/EgYvo4wfLTZrnVS ebdKq0XSZJvzajKQDUv1/P3NwbEP7cN5+Odivv9/ysPfhHfkOP6b9Fl+91v7LD9aCvp/+Zi+7lLQ j0zwNzYFP+/Y6r1NcFeDbfBIo8rug3zS3/3WPumPlN3/y8f0I2X3cz4FP+/Y6htSdr2I42fEuSPX /ewpL4e9/n1evzn94hb+Plpw2+dnbyh79zx0CsPvbq0lb+UQ/h7xvqPq/Gc24PnR18fzVrp8I57d mehj7ebk5VdPnp+d3GJOSP189eTsaXyk/JV7l98j4SAZgRxtf7x155PR+O6jz36Pw9/1Wz/+e/5u v//vbsfQAxobws8M9v7xLXp/785/395ED4nO1wx5fsTeH4LnRva+eYY8rpZUBFb/j/jfm8XAvfEj 4/b/ljF1F9B/jx5PhAkp1nu/+y3n+kdZp/93jWmjJ/M11TG++VEG6puZn593PPejoOyHMQU/79jq GwrKfpSB+tmcwZ93XPkjZffDmIKfd2z1DSm7bmCoPPmjBNT74XkrVf71I/Sf6wTU7XJA4RB+lIC6 mW1+xN5GWw1/683C5rnj/m364cmr45Pf6/SN9H4Us4LISn355vjN2ZcvtDGT6fHvapJcMISmxc0K MAD4IyP6/5Yx/SwkP360FvD1VTH191mURr/HUY+2P3I9boPnz7Ju/pHrcWPnP3I9/r/L3sN0v52z 0fEgNrgbL8/Evfh9fw/q5Xf93u/97vvf+2Lx/e89L7+/Fe3iZ37f34P5h178kTfx/5YxfUs8vY26 7/d4/OWbb5++ogn7PX5XzOHtOP3GrsHmqobOVO/8Hh1Gk/TPl198QS6w+rLb23fcZ0fMaTfjsv29 7Zul7me2v0FgRoYVURnf9nZEkDD+H2VDf8hjeq8xff1s6GbButNLacEtefHm9VdPXp++CRTw7/v9 r6vW8b9eJ0+/PIHzs1HHdyKE/x9L4Y+s2f+PJPX/1dbsJn3wrY6wiqv85vjVm9Pnp+DgN8efM5va j794+eb36Xz3mAf5+58+f3r68s230dRvJcxKn/l//oh3f+7H9K2O0r05PXf85s2rH83f/1vGdAvd w+qBFqsoWvzspozD77EpXYeZ7yzdfxy0ec+l+8e/8FbR84+Wd78xbvn/qQQMz/J7L++GPB7N0MQa 2vTMBwjDrVI0PxKGb4xxfiQMX0cYPuq/Fbx2C1sU8yEF+F34iNsx1xOGa9t6l/yX70uqmxu+qBGm AxlxWwVS11O97ULqlsFIUvUnT4/fHIuL//3f9/t9J39Y9m8W/Tuc296yUeX/b0PiHwUeP1801Y8C j/9vz9+PAo8f+Vq35Jb/n0rAz7Kv9aPA40fC8P+RMf3sC8PP08DjR1L3DXHoj6SuIz/CCghZNZb8 fb/Hf/2+37tjvuBY9vu3jmRvxNeGgQAuaAF6Pwj8/+e66M8/7rwpRNj6uVwXZRl52k0n3FVl95Q+ +fz0KSu73/dtkGDYdvZgSP5uskadrtViRKyal2IKAiQfiW+FI+tET/9/Txj9SFf8SFf8rOuKzagx +r/vD34mUADO1P4/AQAA//8= The options to set is RegexOptions.ExplicitCapture. The capture group you are looking for is ELEMENTNAME. If the capture group ERROR is not empty then there was a parsing error and the Regex stopped.

If you have problems reconverting it to a human readable regex, this should help: static string FromBase64(string str) { byte byteArray = Convert. FromBase64String(str); using (var msIn = new MemoryStream(byteArray)) using (var msOut = new MemoryStream()) { using (var ds = new DeflateStream(msIn, CompressionMode. Decompress)) { ds.

CopyTo(msOut); } return Encoding. UTF8. GetString(msOut.ToArray()); } } If you are unsure, no, I'm NOT kidding (but perhaps I'm lying).

It WILL work. I've built tons of unit tests to test it, and I have even used (part of) the conformance tests ( http://www.w3.org/XML/Test/ ). It's a tokenizer, not a full blown parser, so it will only split the XML in it's component tokens.It won't parse/integrate DTDs.

Oh... if you want the source code of the regex, with some auxiliary methods: http://snipt.net/xanatos/regex-to-tokenize-an-xml.

3 not-sure-if-serious. Jpg -- hopefully this is brilliant satire – bemace Mar 8 at 14:53 4 Good Lord, it's massive. My biggest question is why?

You realize that all modern languages have XML parsers, right? You can do all that in like 3 lines and be sure it'll work. Furthermore, do you also realize that pure regex is provably unable to do certain things?

Unless you've created a hybrid regex/imperative code parser, but it doesn't look like you have. Can you compress random data as well? – Justin Morgan Mar 8 at 15:23 5 @Justin I don't need a reason.It could be done (and it wasn't illegal/immoral), so I have done it.

There are no limitations to the mind except those we acknowledge (Napoleon .. Modern languages can parse XML? Really? And I thought that THAT was illegal!

:-) – xanatos Mar 8 at 15:31 2 Well, we're talking about the theoretical limits of the language; it's not like we just haven't figured out how to do it yet. If you use pure regex, there's always going to be some (X)HTML valid code that breaks it. Maybe it's quux"> or maybe it's a certain nesting depth.

Not that I didn't find your post funny (Old Ones - hah! ) – Justin Morgan Mar 8 at 15:42 5 @Justin So an Xml Parser is by definition bug free, while a Regex isn't? Because if an Xml Parser isn't bug free by definition there could be an xml that make it crash and we are back to step 0.

Let say this: both the Xml Parser and this Regex try to be able to parse all the "legal" XML. They CAN parse some "illegal" XML. Bugs could crash both of them.

C# XmlReader is surely more tested than this Regex. – xanatos Mar 87 at 15:08.

I don't know your exact need for this, but if you are also using . NET, couldn't you use Html Agility Pack? Excerpt: It is a .

NET code library that allows you to parse "out of the web" HTML files. The parser is very tolerant with "real world" malformed HTML.

?(? It is similar to yours, but the last > must not be after a slash, and also accepts h1.

24 3"> Oops – Gareth Nov 13 '09 at 23:11 2 That is very true, and I did think about it, but I assumed the > symbol is properly escaped to >. – Kobi Nov 13 '09 at 23:16 26 > is valid in an attribute value. Indeed, in the ‘canonical XML’ serialisation you must not use >.(Which isn't entirely relevant, except to emphasise that > in an attribute value is not at all an unusual thing.) – bobince Nov 14 '09 at 0:15.

I know Java isn't cool anymore, but if you want to use a really good library in Java, you might check into Tag soup which is built on top of Xerces. home.ccil.org/~cowan/XML/tagsoup.

92 Java was never cool ;) – notandy Dec 8 '09 at 19:54 4 Java is cool as a platform for better languages, but i'm off topic. – rplevy Mar 24 '10 at 23:53 3 PHP is cool - because 50 milion newbies just can't be wrong. – Ondra Žižka Aug 5 at 13:26.

While the answers that you can't parse HTML with regexes are correct, they don't apply here. The OP just wants to parse one HTML tag with regexes, and that is something that can be done with a regular expression. The suggested regex is wrong, though: If you add something to the regex, by backtracking it can be forced to match silly things like >, ^/ is too permissive.

Also note that *^/* is redundant, because the ^/* can also match spaces. My suggestion would be *(? Where (?, the last of which may not be a /, followed by >". Note that this allows things like (just like the original regex), so if you want something more restrictive, you need to build a regex to match attribute pairs separated by spaces.

You should check PHP DOM Functions. Very handy once you study this tutorial : php.net/manual/en/book.dom.php.

1 This is actually a very good answer! – AntonioCS Dec 28 '09 at 22:11.

The W3C explains parsing in a pseudo regexp form: w3.org/TR/REC-xml-names/#ns-using Follow the var links for QName, S, and Attribute to get a clearer picture. Based on that you can create a pretty good regexp to handle things like stripping tags.

I used a open source tool called HTMLParser before. It's designed to parse HTML in various ways and serves the purpose quite well. It can parse HTML as different treenode and you can easily use its API to get attributes out of the node.

Check it out and see if this can help you.

Whenever I need to quickly extract something from an HTML document, I use tidy to convert it to XML and then use XPath or XSLT to get what I need. In your case, something like this: //p/a@href='foo'.

Here is a PHP based parser that parses HTML using some ungodly regex. As the author of this project, I can tell you it is possible to parse HTML with regex, but not efficient. If you need a server-side solution (as I did for my wp-Typography WordPress plugin), this works.

If you need this for PHP: The PHP dom functions won't work properly unless it is properly formatted XML. No matter how much better their use is for the rest of mankind. Simplehtmldom is good, but I found it a bit buggy, and it is is quite memory heavy Will crash on large pages.

I have never used querypath, so can't comment on its usefulness. Another one to try is my DOMParser which is very light on resources and I've been using happily for a while. Simple to learn & powerful.

For Python and Java, similar links were posted. For the downvoters - I only wrote my class when the XML parsers proved unable to withstand real use. Religious downvoting just prevents useful answers from being posted - keep things within perspective of the question, please.

7 One of the more malformed and misguided Star Wars references I've seen. – Matthew Read Aug 8 at 17:59.

There are some nice regexes for replacing HTML with BBCode here garyshood.com/htmltobb/source.txt. For all you nay-sayers, note that he's not trying to fully parse HTML, just to sanitize it. He can probably afford to kill off tags that his simple "parser" can't understand.

You want the first > not preceded by a /. Look here for details on how to do that. Its referred to as negative lookbehind.

However, a naive implementation of that will end up matching in this example document Can you provide a little more information on the problem you're trying to solve? Are you iterating through tags programatically?

It seems to me you're trying to match tags without a "/" at the end. Try this: *(?

As many people have already pointed out, HTML is not a regular language which can make it very difficult to parse. My solution to this is to turn it into a regular language using a tidy program and then to use an XML parser to consume the results. There are a lot of good options for this.My program is written using Java with the jtidy library to turn the HTML into XML and then Jaxen to xpath into the result.

Although it's not suitable and effective to use regular expressions for that purpose sometimes regular expressions provide quick solutions for simple match problems and in my view it's not that horrbile to use regular expressions for trivial works. There is a definitive blog post about matching innermost HTML elements written by Steven Levithan.

I've recently wrote a HTML sanitizer in Java. It is based on a mixed approach of regular expressions and Java code. Personally I hate regular expressions and its folly (readability, maintainability, etc.), but if you reduce the scope of its applications it may fit your needs.

Anyway, my sanitizer uses a white list for HTML tags and a black list for some style attributes. For your convenience I have set up a playground so you can test if the code matches your requirements: playground and Java code. Your feedback will be appreciated.

There is a small article describing this work on my blog: roberto. Open-lab.com.

I agree that the right tool to parse XML and especially HTML is a parser and not a regular expression engine. However, like others have pointed out, sometimes using a regex is quicker, easier, and gets the job done if you know the data format. Microsoft actually has a section of Best Practices for Regular Expressions in the .

NET Framework and specifically talks about Considering the Input Source. Regular Expressions do have limitations, but have you considered the following? C# is unique when it comes to regular expressions in that it supports Balancing Group Definitions.

See Matching Balanced Constructs with . NET Regular Expressions See . NET Regular Expressions: Regex and Balanced Matching See Microsoft's docs on Balancing Group Definitions For this reason, I believe you CAN parse XML using regular expressions.

Note however, that it must be valid XML (browsers are very forgiving of HTML and allow bad XML syntax inside HTML). This is possible since the "Balancing Group Definition" will allow the regular expression engine to act as a PDA. Quote from article 1 cited above: .

NET Regular Expression Engine As described above properly balanced constructs cannot be described by a regular expression. However, the . NET regular expression engine provides a few constructs that allow balanced constructs to be recognized.(?) - pushes the captured result on the capture stack with the name group.(?) - pops the top most capture with the name group off the capture stack.(?(group)yes|no) - matches the yes part if there exists a group with the name group otherwise matches no part.

These constructs allow for a . NET regular expression to emulate a restricted PDA by essentially allowing simple versions of the stack operations: push, pop and empty. The simple operations are pretty much equivalent to increment, decrement and compare to zero respectively.

This allows for the . NET regular expression engine to recognize a subset of the context-free languages, in particular the ones that only require a simple counter. This in turn allows for the non-traditional .

NET regular expressions to recognize individual properly balanced constructs. Consider the following regular expression: (?=) (?> | (?/*>) | (?*>) | /*/> | ^* )* (?(opentag)(?! )) Use the flags: Singleline IgnorePatternWhitespace (not necessary if you collapse regex and remove all whitespace) IgnoreCase (not necessary) Regular Expression Explained (inline) (?=) # match start with # atomic group / don't backtrack (for efficiency) | # match xml / html comment (?/*>) | # push opening xml tag (?*>) | # pop closing xml tag /*/> | # self closing tag ^* # something between tags )* # match as many xml tags as possible (?(opentag)(?!)) # ensure no 'opentag' groups are on stack You can try this at A Better .

NET Regular Expression Tester. I used the sample source of: stuff... more stuff still more Another >ulululFunny enough, it cites the answer to this question that currently has over 4k votes.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions