eyberg eyberg - 1 year ago 54
Java Question

How do I split sentences?

So, I found and am currently using Stanford Parser and it works GREAT for splitting sentences. Most of our sentences are from AP so it works very well for that task.

Here's the problems:

  • it eats a LOT of memory (600M a lot)

  • it really screws up the formatting of a body of text where I have to make a lot of edge cases for later on. (the document pre-processor API calls don't allow to specify ascii/utf8 quotes -- they immediately goto latex style, contractions get split into different words (obviously) and spurious spaces are put into different places)

To this end, I've already written multiple patches to compensate for what I really shouldn't be having to do.

Basically it's at the point where it is just as much of a hindrance to use as the problem of splitting sentences to begin with.

What are my other options? Any other NLP type of frameworks out there that might help out?

My original problem is just being able to detection sentence edges with a high degree of probability.

Answer Source

If you want to try sticking with the Stanford Tokenizer/Parser, look at the documentation page for the tokenizer.

If you just want to split sentences, you don't need to invoke the parser proper, and so you should be able to get away with a tiny amount of memory - a megabyte or two - by directly using DocumentPreprocessor.

While there is only limited customization of the tokenizer available, you can change the processing of quotes. You might want to try one of:


The first will mean no quote mapping of any kind, the second would change single or doubled ascii quotes (if any) into left and right quotes according to the best of its ability.

And while the tokenizer splits words in various ways to match Penn Treebank conventions, you should be able to construct precisely the original text from the tokens returned (see the various other fields in the CoreLabel). Otherwise it's a bug.