btaek btaek - 1 month ago 20
Apache Configuration Question

Apache Nutch 1.12 with Apache Solr 6.2.1 give an error

I am using Apache Nutch 1.12 and Apache Solr 6.2.1 to crawl data on the internet and index them, and the combination gives an error: java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down

I have done the following as I have learned from the Nutch tutorial: https://wiki.apache.org/nutch/NutchTutorial


  • copied schema.xml of Nutch and placed it in Solr's config folder

  • Placed a seed url (of a newspaper company) in urls/seed.txt of Nutch

  • changed http.content.limit value to "-1" in nutch-site.xml. Since the seed url is the one of newspaper company, I just had to elimiate the http content download size limit



When I run the following command, I get an error:

bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 2


Above, TSolr is just the name of the Solr Core as you can probably guess already.

I am pasting the error log in hadoop.log below:

2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: TestCrawl/crawldb
2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: TestCrawl/linkdb
2016-10-28 16:21:20,982 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: TestCrawl/segments/20161028161642
2016-10-28 16:21:46,353 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:21:46,355 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1281422650/.staging/job_local1281422650_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:21:46,415 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:21:46,416 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1281422650_0001/job_local1281422650_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:21:46,565 INFO anchor.AnchorIndexingFilter - Anchor deduplication is: off
2016-10-28 16:21:52,308 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: content dest: content
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: title dest: title
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: host dest: host
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:21:52,383 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:21:52,424 INFO solr.SolrIndexWriter - Indexing 42/42 documents
2016-10-28 16:21:52,424 INFO solr.SolrIndexWriter - Deleting 0 documents
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: content dest: content
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: title dest: title
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: host dest: host
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:21:53,468 INFO solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:21:53,469 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:21:53,472 INFO indexer.IndexingJob - Indexer: number of documents indexed, deleted, or skipped:
2016-10-28 16:21:53,476 INFO indexer.IndexingJob - Indexer: 42 indexed (add/update)
2016-10-28 16:21:53,477 INFO indexer.IndexingJob - Indexer: finished at 2016-10-28 16:21:53, elapsed: 00:00:32
2016-10-28 16:21:54,199 INFO indexer.CleaningJob - CleaningJob: starting at 2016-10-28 16:21:54
2016-10-28 16:21:54,344 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2016-10-28 16:22:19,739 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:22:19,741 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/staging/btaek1653313730/.staging/job_local1653313730_0001/job.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:22:19,797 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2016-10-28 16:22:19,799 WARN conf.Configuration - file:/tmp/hadoop-btaek/mapred/local/localRunner/btaek/job_local1653313730_0001/job_local1653313730_0001.xml:an attempt to override final parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
2016-10-28 16:22:19,807 WARN output.FileOutputCommitter - Output Path is null in setupJob()
2016-10-28 16:22:25,113 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: content dest: content
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: title dest: title
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: host dest: host
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: segment dest: segment
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: boost dest: boost
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: digest dest: digest
2016-10-28 16:22:25,188 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2016-10-28 16:22:25,191 INFO solr.SolrIndexWriter - SolrIndexer: deleting 6/6 documents
2016-10-28 16:22:25,300 WARN output.FileOutputCommitter - Output Path is null in cleanupJob()
2016-10-28 16:22:25,301 WARN mapred.LocalJobRunner - job_local1653313730_0001
java.lang.Exception: java.lang.IllegalStateException: Connection pool shut down
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: java.lang.IllegalStateException: Connection pool shut down
at org.apache.http.util.Asserts.check(Asserts.java:34)
at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
at org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
at org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
at org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
at org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
at org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
2016-10-28 16:22:25,841 ERROR indexer.CleaningJob - CleaningJob: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:836)
at org.apache.nutch.indexer.CleaningJob.delete(CleaningJob.java:172)
at org.apache.nutch.indexer.CleaningJob.run(CleaningJob.java:195)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.CleaningJob.main(CleaningJob.java:206)


As you can see in the bin/crawl command above, I had Nutch run crawl for 2 rounds. The thing is that the error above only occurs on the second round (1 level deeper of the seed site). So, indexing works successfully on the first round, but after the second crawl and parse for the second round, it spits out the error and stops.

To try things a bit differently from the first run as I have done above, I did the following on the second run:


  • Deleted TestCrawl folder to start crawl and index fresh new

  • ran:
    bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1
    ==> note that I have changed the number of round for Nutch to "1". And, this executes crawling and indexing successfully

  • Then, ran the same command again for the second round to crawl 1 level deeper:
    bin/crawl -i -D solr.server.url=http://localhost:8983/solr/TSolr urls/ TestCrawl/ 1
    ==> which gives me the same error as I have pasted the hadoop.log above!!



Therefore, for my Solr is NOT able to successfully index what Nutch crawled for the second round or 1 level deeper of the seed site.

Could the error be due to the parsed contents size of the seed site? The seed site is a newspaper company's website, so I am sure that the second round (1 level deeper) would contain a hugh amount of data parsed to index. If the issue is parseed content size, how can I configure my Solr to fix the problem?

If the error is from something else, can someone please help me identify what it is and how to fix it?

Answer

For those who experience something that I have experienced, I thought I would post the solution to the problem that I was having.

Fist of all, Apach Nutch 1.12 does not seem to support Apache Solr 6.X. If you check out Apache Nutch 1.12 release note, they recently added feature to support Apache Solr 5.X to Nuch 1.12, and the support for Solr 6.X is NOT included. So, instead of Solr 6.2.1, I decided to work with Solr 5.5.3. Thus, I installed Apache Solr 5.5.3 to work with Apache Nutch 1.12

As Jorge Luis pointed out, Apache Nutch 1.12 has a bug, and it gives error when it works with Apache Solr. They will fix the bug and release Nutch 1.13 at some point, but I don't know when that would be, so I decided to fix the bug myself.

The reason why I got the error is because the close method in CleaningJob.java(of Nutch) is invoked first and then the commit method. Then, the following exception is thrown: java.lang.IllegalStateException: Connection pool shut down.

The fix is actually quite simple. To learn the solution, go here: https://github.com/apache/nutch/pull/156/commits/327e256bb72f0385563021995a9d0e96bb83c4f8

As you can see in the link above, you simply need to relocate "writers.close();" method.

By the way, in order to fix the error, you would need the Nutch scr package NOT the binary package because you won't be able to edit CleaningJob.java file in Nutch binary package. After the fix, run ant, and you are all set.

After the fix, I no longer get the error!

Hope this helps anyone who is facing the problem that I was facing.