randombee randombee - 4 months ago 11
Java Question

Inconsistent regex character classes in java

How does Java handle receiving an inconsistent regex Pattern? I am trying this:

Pattern p = Pattern.compile("[a-d[m-p][^d][m]]");
Matcher m = p.matcher("d");
System.out.println(m.matches());


for which I am receiving true. However, my character class contains [^d], so according to the regex, it shouldn't be a match. But since d is also contained in the pattern (a-d), the result to the match is positive. So, how is the parsing of the pattern done? Wouldn't it be better if it threw an exception?

Answer

The behavior is correct and documented:

Character classes may appear within other character classes, and may be composed by the union operator (implicit) and the intersection operator (&&).

Also see Java Character Classes reference:

[a-d[m-p]] a through d, or m through p: [a-dm-p] (union)

So, the pattern matches:

  • [ - start of character class
  • a-d - a through d OR
  • [m-p] - m through p OR
  • [^d] - not d OR
  • [m] - m
  • ] - end of the character class.

As d gets matched with a-d the match is returned.

If you want to match a range of symbols except some of them, you need subtraction:

[a-d[m-p][m]&&[^d]]

This regex won't match d since the a-d range is now "tempered" with &&[^d] and will no longer match d.

Comments