Santiago Gil Santiago Gil - 10 days ago 5
Java Question

How to read from Nutch segments without readseg command

I'm using Nutch to crawl some websites, exactly I am crawling this site.

I have got these five segments with all the documents found (around 10.000 documents). Now I want to process the content of the documents without using the

readseg
command, this is, not dumping the segments into plain text.

For this, only the subdirectory
content
of each segment is useful for me (the tags and the content of the document).

I have realised that inside the
content
directory there are two more containers:
data
and
index
. However I haven't found any explanation of them, and how can I read them to process the content inside. I have also found some pointers to this question, but I have not yet understood the algorithm idea.

How is the content stored in a Nutch segment, and how can it be read? I have given the collection website and segments if a short example wants to be given (but not necessary).

Answer

What do you need to do with the content? you could for instance write a custom IndexWriter. It would be invoked during the indexing step and would give you access to the content. Alternatively look at the 'dump' command (org.apache.nutch.tools.FileDumper) and modify the code.

BTW 'Hadoop the Definitive Guide' by Tom White has a nice chapter on the Nutch data structures.

If you want to do further processing of the pages, like NLP or classification, Behemoth can be used to convert Nutch segments into a 'neutral' datastucture on HDFS which can then be processed with various tools.

Comments