How to make Heritrix to continue crawl process on domains found and are not in seed list?

It's been a while since I last worked with Heritrix, but if memory serves me well, you'll need to change the max-link-hops in your settings/profile. The larger you make max-link-hops the more steps ("hops") Heritrix makes from the seed(s) you have defined.

It's been a while since I last worked with Heritrix, but if memory serves me well, you'll need to change the max-link-hops in your settings/profile. The larger you make max-link-hops, the more steps ("hops") Heritrix makes from the seed(s) you have defined.

Thanks; but there is no max-link-hops! There is max-hop; is it that? Is there an infinite value for it?

– Snigger Oct 2 at 18:47 1 @Snigger, yeah, my guess is that max-hop is the one. I don't know if there's am infinite amount for it, perhaps 0 is infinite? Try it out.

But I don't think setting it to infinite (or even a large value) is wise: chances are your crawl will simply be running for years (no kidding)! Unless you have a distributed setup consisting of thousands of machines with massive amounts of storage and bandwidth... – Bart Kiers Oct 2 at 18:54.

By default Heritrix is configured to only crawl URLs on the domains that are in your seed list. Some additional material is also usually crawled as embedded material, hosted elsewhere, is also fetched. If you would like Heritrix to crawl anything it comes across, you'll need to modify the scope.

The scope is composed of a series of decide rules. Each rule can ACCEPT, REJECT or pass on a URL. The last rule to either ACCEPT or REJECT wins.

Typically, the first rule in the list is a blanket reject all, then followed by a SurtPrefixDecideRule that rules in all URLs that match the SURT list. The SURT list is typically built using the seed list. You can however configure the SURT list manually be specifying your own, or (if you really want everything), you can simply remove it and the reject all rule and add an accept all decide rule to the top.

More on configuring Heritrix 3 scoping.

I cant really gove you an answer,but what I can give you is a way to a solution, that is you have to find the anglde that you relate to or peaks your interest. A good paper is one that people get drawn into because it reaches them ln some way.As for me WW11 to me, I think of the holocaust and the effect it had on the survivors, their families and those who stood by and did nothing until it was too late.

Related Questions