Python regular expression for HTML parsing (BeautifulSoup)?

For this particular case, BeautifulSoup is harder to write than a regex, but it is much more robust... I'm just contributing with the BeautifulSoup example, given that you already know which regexp to use :-) from BeautifulSoup import BeautifulSoup #Or retrieve it from the web, etc. Html_data = open('/yourwebsite/page. Html','r').read() #Create the soup object from the HTML data soup = BeautifulSoup(html_data) fooId = soup. Find('input',name='fooId',type='hidden') #Find the proper tag value = fooId.

Attrs21 #The value of the third attribute of the desired tag #or index it directly via fooId'value'.

I think that the "new" keyword is a mismatch. – Andrea Francia May 12 at 13:27.

I agree with Vinko BeautifulSoup is the way to go. However I suggest using fooId'value' to get the attribute rather than relying on value being the third attribute. From BeautifulSoup import BeautifulSoup #Or retrieve it from the web, etc.Html_data = open('/yourwebsite/page.

Html','r').read() #Create the soup object from the HTML data soup = BeautifulSoup(html_data) fooId = soup. Find('input',name='fooId',type='hidden') #Find the proper tag value = fooId'value' #The value attribute.

That's not python! – Aaron Gallagher Jan 27 '09 at 2:58.

Import re reg = re. Compile('" />') value = reg. Search(inputHTML).

Group(1) print 'Value is', value.

Parsing is one of those areas where you really don't want to roll your own if you can avoid it, as you'll be chasing down the edge-cases and bugs for years go come I'd recommend using BeautifulSoup. It has a very good reputation and looks from the docs like it's pretty easy to use.

I agree for a general case, but if you're doing a one-off script to parse one or two very specific things out, a regex can just make life easier. Obviously more fragile, but if maintainability is a non-issue then it's not a concern. That said, BeautifulSoup is fantastic.

– Cody Brocious Sep 10 '08 at 22:13 I love regex, but have to agree with Orion on this one. This is one of the time when the famous quote from Jamie Zawinski comes to mind: "Now you have two problems" – Justin Standard Sep 10 '08 at 22:29.

Pyparsing is a good interim step between BeautifulSoup and regex. It is more robust than just regexes, since its HTML tag parsing comprehends variations in case, whitespace, attribute presence/absence/order, but simpler to do this kind of basic tag extraction than using BS. Your example is especially simple, since everything you are looking for is in the attributes of the opening "input" tag.

Here is a pyparsing example showing several variations on your input tag that would give regexes fits, and also shows how NOT to match a tag if it is within a comment: html = """ """ from pyparsing import makeHTMLTags, withAttribute, htmlComment # use makeHTMLTags to create tag expression - makeHTMLTags returns expressions for # opening and closing tags, we're only interested in the opening tag inputTag = makeHTMLTags("input")0 # only want input tags with special attributes inputTag. SetParseAction(withAttribute(type="hidden", name="fooId")) # don't report tags that are commented out inputTag. Ignore(htmlComment) # use searchString to skip through the input foundTags = inputTag.

SearchString(html) # dump out first result to show all returned tags and attributes print foundTags0.dump() print # print out the value attribute for all matched tags for inpTag in foundTags: print inpTag. Value Prints: 'input', 'type', 'hidden', 'name', 'fooId', 'value', '**id is here**', True - empty: True - name: fooId - startInput: 'input', 'type', 'hidden', 'name', 'fooId', 'value', '**id is here**', True - empty: True - name: fooId - type: hidden - value: **id is here** - type: hidden - value: **id is here** **id is here** **id is here too** **id is HERE too** **and id is even here TOO** You can see that not only does pyparsing match these unpredictable variations, it returns the data in an object that makes it easy to read out the individual tag attributes and their values.

Import re >>> s = '' >>> re. Match('', s).groups() ('fooId', '12-3456789-1111111111').

Html_data = open('/yourwebsite/page. FooId = soup. Value = fooId.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions