user3621749 user3621749 - 5 months ago 39
Java Question

Stanford NLP pipeline – sequential processing (in Java)

How to correctly use Stanford NLP pipeline for two-phase annotation?

In the first phase I need only tokenization and sentence splitting, so I use this code:

private Annotation annotatedDocument = null;
private StanfordCoreNLP pipeline = null;


public void firstPhase() {
Properties props = new Properties();
props.setProperty("annotators", "tokenize, ssplit");

pipeline = new StanfordCoreNLP(props);
annotatedDocument = new Annotation(textDocument);

The second phase is optional, so I don't use all annotator in the first phase. The second phase code:

public void secondPhase() {
POSTaggerAnnotator posTaggerAnot = new POSTaggerAnnotator();

// Lemmatization
MorphaAnnotator morphaAnot = new MorphaAnnotator();

First question: Is this approach using "stand-alone" annotators in the second phase correct? Or is there a way to use existing pipeline?

Second question: I have problem with Correference annotator. I would like use it in the second phase as follow:

CorefAnnotator coref = new CorefAnnotator(new Properties());

But this constructor seems to be never ending. Constructor without properties doesn't exist, right? Is it some properties setting necessary?


There are [at least] 3 ways you can do this:

  1. The way you described. It's perfectly valid to just call individual annotators, and chain them together. The coref annotator should work with empty properties -- perhaps you need more memory? It's a bit slow to load, and the models are not small.

  2. If you want to keep using a pipeline, you can create a partial pipeline and set the property enforceRequirements=false. This will do the chaining of annotators for you, but doesn't require their requirements to be satisfied -- i.e., if you know some annotations are already there, you don't have to re-run their corresponding annotators.

  3. This is a bigger change, but the simple api actually does this sort of lazy evaluation automatically. So, you can just create a Document object, and when you request various annotations, it'll lazily fault them in.