Mike Thomsen Mike Thomsen - 9 months ago 107
Java Question

OpenNLP vs Stanford CoreNLP

I've been doing a little comparison of these two packages and am not sure which direction to go in. What I am looking for briefly is:

  1. Named Entity Recognition (people, places, organizations and such).

  2. Gender identification.

  3. A decent training API.

From what I can tell, OpenNLP and Stanford CoreNLP expose pretty similar capabilities. However, Stanford CoreNLP looks like it has a lot more activity whereas OpenNLP has only had a few commits in the last six months.

Based on what I saw, OpenNLP appears to be easier to train new models and might be more attractive for that reason alone. However, my question is what would others start with as the basis for adding NLP features to a Java app? I'm mostly worried as to whether OpenNLP is "just mature" versus semi-abandoned.

Answer Source

In full disclosure, I'm a contributor to CoreNLP, so this is a biased answer. But, in my view on your three criteria:

  1. Named Entity Recognition: I think CoreNLP clearly wins here, both on accuracy and ease-of-use. For one, OpenNLP has a model per NER tag, whereas CoreNLP detects all tags with a single Annotator. Furthermore, temporal resolution with SUTime is a nice perk in CoreNLP. Accuracy-wise, my anecdotal experience is that CoreNLP does better on general-purpose text.

  2. Gender identification. I think both tools are kind of poorly documented on this front. OpenNLP seems to have a GenderModel class; CoreNLP has a gender Annotator.

  3. Training API. I suspect the OpenNLP training API is easier-to-use for not off-the-shelf training. But, if all you want to do is, e.g., train a model from a CoNLL file, both should be straightforward. Training speed tends to be faster with CoreNLP than other tools I've tried, but I haven't benchmarked it formally, so take that with a grain of salt.