Gabriel Rodríguez Gabriel Rodríguez - 1 month ago 10
Apache Configuration Question

Why does Nutch (v2.3) crawl only the seed URL, instead of crawling an entire website?

I am trying to crawl an entire, specific website (ignoring external links) using Nutch 2.3 with HBase 0.94.14.

I have followed a step-by-step tutorial (can find it here) on how to set up and use these tools. However, I haven't been able to achieve my goal. Instead of crawling the entire website whose URL I've written in the seed.txt file, Nutch only retrieves that base URL in the first round. I need to run further crawls in order for Nutch to retrieve more URLs.

The problem is I don't know how many rounds I need in order to crawl the entire website, so I need a way to tell Nutch to "keep crawling until the entire website has been crawled" (in other words, "crawl the entire website in a single round").

Here are the key steps and settings I have followed so far:


  1. Put base URL in the seed.txt file.

    http://www.whads.com/







  1. Set up Nutch's nutch-site.xml configuration file. After finishing the tutorial, I added a few more properties following suggestions on other StackOverflow questions (none of them, however, seem to have solved the problem for me).

    <?xml version="1.0"?>
    <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

    <!-- Put site-specific property overrides in this file. -->
    <configuration>
    <property>
    <name>http.agent.name</name>
    <value>test-crawler</value>
    </property>
    <property>
    <name>storage.data.store.class</name>
    <value>org.apache.gora.hbase.store.HBaseStore</value>
    </property>
    <property>
    <name>plugin.includes</name>
    <value>protocol-httpclient|urlfilter-regex|index-(basic|more)|query-(basic|site|url|lang)|indexer-solr|nutch-extensionpoints|protocol-httpclient|urlfilter-regex|parse-(text|html|msexcel|msword|mspowerpoint|pdf)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metadata)</value>
    </property>
    <property>
    <name>db.ignore.external.links</name>
    <value>true</value>
    </property>
    <property>
    <name>db.ignore.internal.links</name>
    <value>false</value>
    </property>
    <property>
    <name>fetcher.max.crawl.delay</name>
    <value>-1</value>
    </property>
    <property>
    <name>fetcher.threads.per.queue</name>
    <value>50</value>
    <description></description>
    </property>
    <property>
    <name>generate.count.mode</name>
    <value>host</value>
    </property>
    <property>
    <name>generate.max.count</name>
    <value>-1</value>
    </property>
    </configuration>







  1. Added "accept anything else" rule to Nutch's regex-urlfilter.txt configuration file, following suggestions on StackOverflow and Nutch's mailing list.

    # Already tried these two filters (one at a time,
    # and each one combined with the 'anything else' one)
    #+^http://www.whads.com
    #+^http://([a-z0-9]*.)*whads.com/

    # accept anything else
    +.







  1. Crawling: I have tried using two different approaches (both yielding the same result, with only one URL generated and fetched on the first round):


    • Using
      bin/nutch
      (following the tutorial):

      bin/nutch inject urls
      bin/nutch generate -topN 50000
      bin/nutch fetch -all
      bin/nutch parse -all
      bin/nutch updatedb -all

    • Using
      bin/crawl
      :

      bin/crawl urls whads 1







Am I still missing something? Am I doing something wrong? Or is it that Nutch can't crawl an entire website in one go?

Thank you so much in advance!

Answer Source

After playing around with Nutch for a few more days trying everything I found on the Internet, I ended up giving up. Some people said it is no longer possible to crawl an antire website in one go with Nutch. So, in case anyone having the same problem stumbles upon this question, do the same I did: drop Nutch and use something like Scrapy (Python). You need to manually set up the spiders, but it works like a charm, is far more extensible and faster, and the results are better.