I am trying to crawl an entire, specific website (ignoring external links) using Nutch 2.3 with HBase 0.94.14.
I have followed a step-by-step tutorial (can find it here) on how to set up and use these tools. However, I haven't been able to achieve my goal. Instead of crawling the entire website whose URL I've written in the seed.txt file, Nutch only retrieves that base URL in the first round. I need to run further crawls in order for Nutch to retrieve more URLs.
The problem is I don't know how many rounds I need in order to crawl the entire website, so I need a way to tell Nutch to "keep crawling until the entire website has been crawled" (in other words, "crawl the entire website in a single round").
Here are the key steps and settings I have followed so far:
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
# Already tried these two filters (one at a time,
# and each one combined with the 'anything else' one)
# accept anything else
bin/nutch inject urls
bin/nutch generate -topN 50000
bin/nutch fetch -all
bin/nutch parse -all
bin/nutch updatedb -all
bin/crawl urls whads 1
After playing around with Nutch for a few more days trying everything I found on the Internet, I ended up giving up. Some people said it is no longer possible to crawl an antire website in one go with Nutch. So, in case anyone having the same problem stumbles upon this question, do the same I did: drop Nutch and use something like Scrapy (Python). You need to manually set up the spiders, but it works like a charm, is far more extensible and faster, and the results are better.