Lokesh Kakran Lokesh Kakran - 1 year ago 109
Apache Configuration Question

how to crawl particular website using Apache Nutch?

i have followed below url and done successfully till Step-by-Step: Invertlinks


But i didn't get any data regarding them

i am new to this techno,

please give steps/demo/site/example if someone has done it before successfully.
please do not give rough steps.

Answer Source

first install the nutch:

under configuration of nutch-site.xml, paste:

    <value>My Nutch Spider</value>

Under your nutch-default.xml: add

  <description>Comma separated list of hostnames or IP addresses to ignore
  robot rules parsing for. Use with care and only if you are explicitly
  allowed by the site owner to ignore the site's robots.txt!

Under regex-urlfilter.txt :

# accept anything else

and also comment the

# skip URLs containing certain characters as probable queries, etc.

then run the below commands

bin/nutch inject crawl/crawldb dmoz
bin/nutch inject crawl/crawldb urls
bin/nutch generate crawl/crawldb crawl/segments
s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
bin/nutch fetch $s1
bin/nutch parse $s1
bin/nutch updatedb crawl/crawldb $s1

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Now check your data in the crawl/crawldb folder & other successfully.