I'm using Nutch to crawl some websites, exactly I am crawling this site.
I have got these five segments with all the documents found (around 10.000 documents). Now I want to process the content of the documents without using the
What do you need to do with the content? you could for instance write a custom IndexWriter. It would be invoked during the indexing step and would give you access to the content. Alternatively look at the 'dump' command (org.apache.nutch.tools.FileDumper) and modify the code.
BTW 'Hadoop the Definitive Guide' by Tom White has a nice chapter on the Nutch data structures.
If you want to do further processing of the pages, like NLP or classification, Behemoth can be used to convert Nutch segments into a 'neutral' datastucture on HDFS which can then be processed with various tools.