Python string operation, extract text between html tags?

While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup which is a Python lib that can handle broken as well as good HTML fairly well from BeautifulSoup import BeautifulSoup as BSHTML >>> BS = BSHTML(""" ... ... JUL 28 """ ... ) >>> BS.font. Contents0.strip() u'JUL 28 Then you just need to parse the date: datetime.

Strptime(BS.font. Contents0.strip(), '%b %d') >>> datetime. Datetime(1900, 7, 28, 0, 0) datetime.

Datetime(1900, 7, 28, 0, 0).

While it may be possible to parse arbitrary HTML with regular expressions, it's often a death trap. There are great tools out there for parsing HTML, including BeautifulSoup, which is a Python lib that can handle broken as well as good HTML fairly well. >>> from BeautifulSoup import BeautifulSoup as BSHTML >>> BS = BSHTML(""" ... ... JUL 28 """ ... ) >>> BS.font.

Contents0.strip() u'JUL 28' Then you just need to parse the date: >>> datetime. Strptime(BS.font. Contents0.strip(), '%b %d') >>> datetime.

Datetime(1900, 7, 28, 0, 0) datetime. Datetime(1900, 7, 28, 0, 0).

Nice! This seems much less complicated than the regex way. – Flux Capacitor Oct 27 at 4:12 @FluxCapacitor A word of warning: My second argument to strptime above is actually a locale-specific example.

Please refer to the documentation for more details if you need a locale-agnostic or different locale solution. – kojiro Oct 27 at 4:14.

You have a bunch of options here. You could go for an all-out xml parser like lxml, though you seem to want a domain-specific solution. I'd go with a multiline regex: import re rex = re.

Compile(r'(.*? )',re. S|re.

M) ... data = """ JUL 28 """ match = rex. Match(data) if match: text = match.groups()0.strip() Now that you have text, you can turn it into a date pretty easily: from datetime import datetime date = datetime. Strptime(text, "%b %d").

You commented on AnthonyHurst's answer that this is from a website. I've used lxml's html parsing with a lot of success recently, I highly recommend it. – fahhem Oct 27 at 4:01 Thanks!

I had seen something similar with regular expressions in another question, but wasn't able to make it work. Your solution worked for me perfectly. The downside is that I only sort of understand what's going on with it.

– Flux Capacitor Oct 27 at 4:09.

Or, you could simply use Beautiful Soup: Beautiful Soup is a Python HTML/XML parser designed for quick turnaround projects like screen-scraping.

Probably overkill but a good choice if there's more HTML parsing to be done. – Brendan Long Oct 27 at 4:04.

Grep "*>(.*)*>" file The (.*) should match your content.

I'm doing all this in Python... I used scrapy to scrape a webpage and drill down to arrive at the string above. – Flux Capacitor Oct 27 at 3:57 sorry then I couldn't assist you better. You could always use the re (regular expression) library to grab the same thing.

– AnthonyHurst Oct 27 at 4:01.

Use Scrapy's XPath selectors as documented at doc.scrapy.org/en/0.10.3/topics/selector... Alternatively you can utilize an HTML parser such as BeautifulSoup especially if want to operate on the document in an object oriented manner. pypi.python.org/pypi/BeautifulSoup/3.2.0.

Python has a library called HTMLParser. Also see the following question posted in SO which is very similar to what you are looking for: How can I use the python HTMLParser library to extract data from a specific div tag?

I wish to extract the string that's in between the tags. In this case, it's JUL 28, but it might be another date or some other number. 1) The best way to extract the value from between the font tags?

I was thinking I could extract everything in between "> and.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions