Belphegor Belphegor - 1 month ago 14
Java Question

Start/end of regex in TokensRegex

Suppose I have the following code:

TokenSequencePattern p = TokenSequencePattern.compile("[{tag:/JJ.*/}] [{tag:/NN.*/}]");
TokenSequenceMatcher m = tPattern.getMatcher(coreLabelList);
while (tMatcher.find()){
List<CoreMap> matches = m.groupNodes();
}


What I would like to capture here is an adjective followed by a noun, i.e. it must start with one adjective and it must end with one noun. For example, if I have "beautiful scarf" it should be a match, but if I have "beautiful scarf with white dots" it shouldn't be a match. For now, the token regex from above is a match for both of the phrases. How do I specify the exact start of a sequence and it's exact end?

Answer

You may use

TokenSequencePattern p = TokenSequencePattern.compile("[tag:/JJ.*/] [tag:/NN.*/]");

Testing with A round ball is bouncing very high in to the blue sky. yielded round ball and blue sky substrings.

To only get an entire string match, you need to use anchors if you want to use Matcher#find() (with Matcher#matches(), the anchors are implied).

So, to only match round ball string as a combination of an adjective and a noun, you may use

TokenSequencePattern p = TokenSequencePattern.compile("^[tag:/JJ.*/] [tag:/NN.*/]$");

or

TokenSequencePattern p = TokenSequencePattern.compile("\\A[tag:/JJ.*/] [tag:/NN.*/]\\z");

The ^ / \A stand for the beginning of a string (also, \A will always match at the beginning of a string) and $ / \z match the end of a string (note that \z will always match the very end of the string while $ - even if you are not using a multiline modifier - allows a trailing newline after it).

Note: the anchors are tested on CoreNLP 3.7.0. They don't work on some versions (e.g. don't work on CoreNLP 3.5.1, it throws an error: Lexical error at line 1, column 1. Encountered: "^" (94), after : "")