Web mining or scraping or crawling? What tool/library should I use?

Introducing the ONLY search engine optimization software product that has been 100% PROVEN to dramatically increase your rankings in Google, Yahoo, and Bing. Get it now!

There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression.

There really is no good solution here. You are right as you suspect that Python is probably the best way to start because of it's incredibly strong support of regular expression. In order to implement something like this, strong knowledge of SEO (Search Engine Optimization) would help since effectively optimizing a webpage for search engines tells you how search engines behave.

I would start with a site like SEOMoz. As far as identifying the "about us" page, you only have 2 options: a) For each page get the link of the about us page and feed it to your crawler. B) Parse all the links of the page for certain keywords like "about us", "about" "learn more" or whatever.In using option b, be careful as you could get stuck in an infinite loop since a website will link to the same page many times especially if the link is in the header or footer a page may link back to itself even.

To avoid this you'll need to create a list of visited links and make sure not to revisit them. Finally, I would recommend having your crawler respect instructions in the robot. Txt file and it's probably a great idea not to follow links marked rel="nofollow" as these are mostly used on external links.

Again, learn this and more by reading up on SEO. Regards.

When going Python, you might be interested in mechanize and BeautifulSoup. Mechanize sort of simulates a browser (including options for proxying, faking browser identifications, page redirection etc. ) and allows easy fetching of forms, links, ... The documentation is a bit rough/sparse though. Some example code (from the mechanize website) to give you an idea: import mechanize br = mechanize.Browser() br.

Open("example.com/") # follow second link with element text matching regular expression html_response = br. Follow_link(text_regex=r"cheese\s*shop", nr=1) print br.title() print html_response BeautifulSoup allows to parse html content (which you could have fetched with mechanize) pretty easily, and supports regexes. Some example code: from BeautifulSoup import BeautifulSoup soup = BeautifulSoup(html_response) rows = soup.

FindAll('tr') for r in rows2:: #ignore first two rows cols = r. FindAll('td') print cols0.renderContents().strip() #print content of first column So, these 10 lines above are pretty much copy-paste ready to print the content of the first column of every table row on a website.

Try out scrapy. It is a web scraping library for python. If a simple python-script is expected, try urllib2 in python.

Heritrix has a bit of a steep learning curve, but can be configured in such a way that only the homepage, and a page that "looks like" (using a regex filter) an about page will get crawled. More open source Java (web) crawlers: java-source.net/open-source/crawlers.

If you are going to builld a crawler you need to (Java specific): Learn how to use the java.net. URL and java.net. URLConnection classes or use the HttpClient library Understand http request/response headers Understand redirects (both HTTP, HTML and Javascript) Understand content encodings (charsets) Use a good library for parsing badly formed HTML (e.g. CyberNecko, Jericho, JSoup) Make concurrent HTTP requests to different hosts, but ensure you issue no more than one to the same host every ~5 seconds Persist pages you have fetched, so you don't need to refetch them every day if they don't change that often (HBase can be useful).

A way of extracting links from the current page to crawl next Obey robots. Txt A bunch of other stuff too. It's not that difficult, but there are lots of fiddly edge cases (e.g. Redirects, detecting encoding (checkout Tika)).

For more basic requirements you could use wget. Heretrix is another option, but yet another framework to learn. Identifying About us pages can be done using various heuristics: inbound link text page title content on page URL if you wanted to be more quantitative about it you could use machine learning and a classifier (maybe Bayesian).

Saving the front page is obviously easier but front page redirects (sometimes to different domains, and often implemented in the HTML meta redirect tag or even JS) are very common so you need to handle this.

Python ==> Curl # import sys import pycurl # We should ignore SIGPIPE when using pycurl. NOSIGNAL - see # the libcurl tutorial for more info. Try: import signal from signal import SIGPIPE, SIG_IGN signal.

Signal(signal. SIGPIPE, signal. SIG_IGN) except ImportError: pass # Get args num_conn = 10 try: if sys.

Argv1 == "-": urls = sys.stdin.readlines() else: urls = open(sys. Argv1).readlines() if len(sys. Argv) >= 3: num_conn = int(sys.

Argv2) except: print "Usage: %s " % sys. Argv0 raise SystemExit # Make a queue with (url, filename) tuples queue = for url in urls: url = url.strip() if not url or url0 == "#": continue filename = "doc_%03d. Dat" % (len(queue) + 1) queue.

Append((url, filename)) # Check args assert queue, "no URLs given" num_urls = len(queue) num_conn = min(num_conn, num_urls) assert 1 Handles = for I in range(num_conn): c = pycurl.Curl() c. Fp = None c. Setopt(pycurl.

FOLLOWLOCATION, 1) c. Setopt(pycurl. MAXREDIRS, 5) c.

Setopt(pycurl. CONNECTTIMEOUT, 30) c. Setopt(pycurl.

TIMEOUT, 300) c. Setopt(pycurl. NOSIGNAL, 1) m.handles.

Append(c) # Main loop freelist = m. Handles: num_processed = 0 while num_processed Fp = open(filename, "wb") c. Setopt(pycurl.

URL, url) c. Setopt(pycurl. WRITEDATA, c.

Fp) m. Add_handle(c) # store some info c. Filename = filename c.

Url = url # Run the internal curl state machine for the multi stack while 1: ret, num_handles = m.perform() if ret! = pycurl. E_CALL_MULTI_PERFORM: break # Check for curl objects which have terminated, and add them to the freelist while 1: num_q, ok_list, err_list = m.

Info_read() for c in ok_list: c.fp.close() c. Fp = None m. Remove_handle(c) print "Success:", c.

Filename, c. Url, c. Getinfo(pycurl.

EFFECTIVE_URL) freelist. Append(c) for c, errno, errmsg in err_list: c.fp.close() c. Fp = None m.

Remove_handle(c) print "Failed: ", c. Filename, c. Url, errno, errmsg freelist.

Append(c) num_processed = num_processed + len(ok_list) + len(err_list) if num_q == 0: break # Currently no more I/O is pending, could do something in the meantime # (display a progress bar, etc.). # We just call select() to sleep until some more data is available. M.

Select(1.0) # Cleanup for c in m. Handles: if c. Fp is not None: c.fp.close() c.

Fp = None c.close() m.close().

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions