zoozoofreak zoozoofreak - 1 year ago 108
Java Question

How to train Chunker in Opennlp?

I need to train the Chunker in Opennlp to classify the training data as a noun phrase. How do I proceed? The documentation online does not have an explanation how to do it without the command line, incorporated in a program. It says to use en-chunker.train, but how do you make that file?

EDIT: @Alaye
After running the code you gave in your answer, I get the following error that I cannot fix:

Indexing events using cutoff of 5

Computing event counts... done. 3 events
Dropped event B-NP:[w_2=bos, w_1=bos, w0=He, w1=reckons, w2=., w_1=bosw0=He, w0=Hew1=reckons, t_2=bos, t_1=bos, t0=PRP, t1=VBZ, t2=., t_2=bost_1=bos, t_1=bost0=PRP, t0=PRPt1=VBZ, t1=VBZt2=., t_2=bost_1=bost0=PRP, t_1=bost0=PRPt1=VBZ, t0=PRPt1=VBZt2=., p_2=bos, p_1=bos, p_2=bosp_1=bos, p_1=bost_2=bos, p_1=bost_1=bos, p_1=bost0=PRP, p_1=bost1=VBZ, p_1=bost2=., p_1=bost_2=bost_1=bos, p_1=bost_1=bost0=PRP, p_1=bost0=PRPt1=VBZ, p_1=bost1=VBZt2=., p_1=bost_2=bost_1=bost0=PRP, p_1=bost_1=bost0=PRPt1=VBZ, p_1=bost0=PRPt1=VBZt2=., p_1=bosw_2=bos, p_1=bosw_1=bos, p_1=bosw0=He, p_1=bosw1=reckons, p_1=bosw2=., p_1=bosw_1=bosw0=He, p_1=bosw0=Hew1=reckons]
Dropped event B-VP:[w_2=bos, w_1=He, w0=reckons, w1=., w2=eos, w_1=Hew0=reckons, w0=reckonsw1=., t_2=bos, t_1=PRP, t0=VBZ, t1=., t2=eos, t_2=bost_1=PRP, t_1=PRPt0=VBZ, t0=VBZt1=., t1=.t2=eos, t_2=bost_1=PRPt0=VBZ, t_1=PRPt0=VBZt1=., t0=VBZt1=.t2=eos, p_2=bos, p_1=B-NP, p_2=bosp_1=B-NP, p_1=B-NPt_2=bos, p_1=B-NPt_1=PRP, p_1=B-NPt0=VBZ, p_1=B-NPt1=., p_1=B-NPt2=eos, p_1=B-NPt_2=bost_1=PRP, p_1=B-NPt_1=PRPt0=VBZ, p_1=B-NPt0=VBZt1=., p_1=B-NPt1=.t2=eos, p_1=B-NPt_2=bost_1=PRPt0=VBZ, p_1=B-NPt_1=PRPt0=VBZt1=., p_1=B-NPt0=VBZt1=.t2=eos, p_1=B-NPw_2=bos, p_1=B-NPw_1=He, p_1=B-NPw0=reckons, p_1=B-NPw1=., p_1=B-NPw2=eos, p_1=B-NPw_1=Hew0=reckons, p_1=B-NPw0=reckonsw1=.]
Dropped event O:[w_2=He, w_1=reckons, w0=., w1=eos, w2=eos, w_1=reckonsw0=., w0=.w1=eos, t_2=PRP, t_1=VBZ, t0=., t1=eos, t2=eos, t_2=PRPt_1=VBZ, t_1=VBZt0=., t0=.t1=eos, t1=eost2=eos, t_2=PRPt_1=VBZt0=., t_1=VBZt0=.t1=eos, t0=.t1=eost2=eos, p_2B-NP, p_1=B-VP, p_2B-NPp_1=B-VP, p_1=B-VPt_2=PRP, p_1=B-VPt_1=VBZ, p_1=B-VPt0=., p_1=B-VPt1=eos, p_1=B-VPt2=eos, p_1=B-VPt_2=PRPt_1=VBZ, p_1=B-VPt_1=VBZt0=., p_1=B-VPt0=.t1=eos, p_1=B-VPt1=eost2=eos, p_1=B-VPt_2=PRPt_1=VBZt0=., p_1=B-VPt_1=VBZt0=.t1=eos, p_1=B-VPt0=.t1=eost2=eos, p_1=B-VPw_2=He, p_1=B-VPw_1=reckons, p_1=B-VPw0=., p_1=B-VPw1=eos, p_1=B-VPw2=eos, p_1=B-VPw_1=reckonsw0=., p_1=B-VPw0=.w1=eos]
Indexing... done.
Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:653)
at java.util.ArrayList.get(ArrayList.java:429)
at opennlp.tools.ml.model.AbstractDataIndexer.sortAndMerge(AbstractDataIndexer.java:89)
at opennlp.tools.ml.model.TwoPassDataIndexer.<init>(TwoPassDataIndexer.java:105)
at opennlp.tools.ml.AbstractEventTrainer.getDataIndexer(AbstractEventTrainer.java:74)
at opennlp.tools.ml.AbstractEventTrainer.train(AbstractEventTrainer.java:91)
at opennlp.tools.ml.model.TrainUtil.train(TrainUtil.java:53)
at opennlp.tools.chunker.ChunkerME.train(ChunkerME.java:253)
at com.oracle.crm.nlp.CustomChunker2.main(CustomChunker2.java:91)
Sorting and merging events... Process exited with exit code 1.

(My en-chunker.train had only the first 2 and last line of your sample data set.)
Could you please tell me why this is happening and how to fix it?

EDIT2: I got the Chunker to work, however it gives an error when I change the sentence in the training set to any sentence other than the one you've given in your answer. Can you tell me why that could be happening?

Answer Source

As said in Opennlp Documentation

Sample sentence of the training data:

He        PRP  B-NP
reckons   VBZ  B-VP
the       DT   B-NP
current   JJ   I-NP
account   NN   I-NP
deficit   NN   I-NP
will      MD   B-VP
narrow    VB   I-VP
to        TO   B-PP
only      RB   B-NP
#         #    I-NP
1.8       CD   I-NP
billion   CD   I-NP
in        IN   B-PP
September NNP  B-NP
.         .    O

This is how you make your en-chunk.train file and you can create the corresponding .bin file using CLI:

$ opennlp ChunkerTrainerME -model en-chunker.bin -lang en -data en-chunker.train -encoding

or using API

public class SentenceTrainer {
   public static void trainModel(String inputFile, String modelFile)
  throws IOException {

      MarkableFileInputStreamFactory factory = new MarkableFileInputStreamFactory(
        new File(inputFile));

    Charset charset = Charset.forName("UTF-8");
    ObjectStream<String> lineStream =
        new PlainTextByLineStream(new FileInputStream("en-chunker.train"),charset);
    ObjectStream<ChunkSample> sampleStream = new ChunkSampleStream(lineStream);

    ChunkerModel model;

    try {
      model = ChunkerME.train("en", sampleStream,
          new DefaultChunkerContextGenerator(), TrainingParameters.defaultParams());
    finally {

    OutputStream modelOut = null;
    try {
      modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
    } finally {
      if (modelOut != null)

and the main method will be:

public class Main {

public static void main(String args[]) throws IOException {
  String inputFile = "//path//to//data.train";
  String modelFile = "//path//to//.bin";

  SentenceTrainer.trainModel(inputFile, modelFile);

reference: this blog

hope this helps!

PS: collect/write the data as above in a .txt file and rename it with .train extension or even the trainingdata.txt will work. that is how you make a .train file.