Extract URLs from specific tags in python?

From BeautifulSoup import BeautifulSoup html = """ ... text ... Don't find me! Don't find me! Don't error on missing href!... """ soup = BeautifulSoup(html) for h3 in soup.

FindAll("h3", {"class": "r"}): for a in h3. FindAll("a", {"class": "l", "href": True}): print a"href".

This is working! Thanks :)) – Mehmet Helvaci Jun 5 at 16:25.

I'd use XPATH, see here for a question what package would be appropriate in Python.

I've never used XPATH, but BeautifulSoup seems to be a standard answer for HTML/XML parsing as per the OP's question. – KyleWpppd Jun 5 at 15:05 I tend to work with more rigid formats, and prefer plainer errors. BeautifulSoup could indeed be a better answer if the HTML isn't to be trusted though, but I'd still be tempted to us XPath because of it's portability.

– Wrikken Jun 5 at 15:10 Thanks, I'll have to take a look into XPATH. – KyleWpppd Jun 5 at 15:15.

You can use a Regular Expressions (RegEx) for that. This RegEx will catch all URL's beginning with http and surrounded by quotes ("): http(^\"+) And this is how it's done in Python: import re myRegEx = re. Compile("http(^\"+)") myResults = MyRegEx.

Search('') Replace by the variable storing the source code you want to search for URL's. MyResults.start() and myResults.end() now contain the starting and ending position of the URL's. Use the myResults.group() function to find the string that matched the RegEx.

If anything isn't clear yet, just ask.

Well, that page belongs to a google query. What I need to achieve is that I want to extract all the websites' links from a google search. Well, take a look at that code: xrayoptics.by.Ru/database/misc/goog2text.

Py Maybe you can rewrite this? That'd be easier for you :) – Mehmet Helvaci Jun 5 at 15:12 1 @RobinJ this does not actually solve the OP's problem. The OP requested links within tags.

– KyleWpppd Jun 5 at 15:14 Have you tried what I suggested yet? I know a bit of Python, but I'm not that good myself. The most complicated thing I've achieved so far is to make an IRC bot in it, wich crashes after about 5 minutes.

Regular Expressions seems to be the easiest solution to me. @KyloWpppd: My bad, I forgot. I'm taking a look if I can't find something for that.

– RobinJ Jun 5 at 15:17 2 Regexes are about the most brittle solutions for this. At best, they usually work on a limited subset of html, at worst, your regexes become pages long and extremely unwieldy.Do a quick SO search on HTML/XML & regexes, and you'll see it's generally frowned upon save for very specific cases. – Wrikken Jun 5 at 15:22 The best solution I found is just finding whatever is between the tags, and then applying the RegEx I suggested above on it.

– RobinJ Jun 5 at 15:27.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions