user3621749 user3621749 - 1 month ago 14
Java Question

Stanford NLP pipeline – sequential processing (in Java)

How to correctly use Stanford NLP pipeline for two-phase annotation?




In the first phase I need only tokenization and sentence splitting, so I use this code:

private Annotation annotatedDocument = null;
private StanfordCoreNLP pipeline = null;

...

public void firstPhase() {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");

pipeline = new StanfordCoreNLP(props);
annotatedDocument = new Annotation(textDocument);
}


The second phase is optional, so I don't use all annotator in the first phase. The second phase code:

public void secondPhase() {
POSTaggerAnnotator posTaggerAnot = new POSTaggerAnnotator();
posAnot.annotate(annotatedDocument);

// Lemmatization
MorphaAnnotator morphaAnot = new MorphaAnnotator();
morphaAnot.annotate(annotatedDocument);
}





First question: Is this approach using "stand-alone" annotators in the second phase correct? Or is there a way to use existing pipeline?

Second question: I have problem with Correference annotator. I would like use it in the second phase as follow:

CorefAnnotator coref = new CorefAnnotator(new Properties());


But this constructor seems to be never ending. Constructor without properties doesn't exist, right? Is it some properties setting necessary?

Answer

There are [at least] 3 ways you can do this:

  1. The way you described. It's perfectly valid to just call individual annotators, and chain them together. The coref annotator should work with empty properties -- perhaps you need more memory? It's a bit slow to load, and the models are not small.

  2. If you want to keep using a pipeline, you can create a partial pipeline and set the property enforceRequirements=false. This will do the chaining of annotators for you, but doesn't require their requirements to be satisfied -- i.e., if you know some annotations are already there, you don't have to re-run their corresponding annotators.

  3. This is a bigger change, but the simple api actually does this sort of lazy evaluation automatically. So, you can just create a Document object, and when you request various annotations, it'll lazily fault them in.