matz3 matz3 - 3 months ago 12
Java Question

Java/Quickrex regex: missing character in group when using negative lookahaed

In Java (using Eclipse Quickrex plugin to test) I'm using the following expression:

(^[\.\(&\)]*)(.*)(?!([\.\(&/\)]*$))


to match the text:

.(&&()..ABC----....D25..../)(&


The expected goal is to match three groups:

(1)
.(&&()..


(2)
ABC----....D25


(3)
..../)(&


the goal is to further continue with the 2nd group and cut preceeding group no.1 and subsequent group no.3. Requirement is that the user should define all three regex expressions by himself in three separate GUI fields.

What is happening: the three groups match fine in QuickRex, but in group no.2
ABC----....D2
the "5" at the end is missing, and also not appearing in group no.3:

[.(&&()..][ABC----....D2]
5
[..../)(&]


Environment: Eclipse Mars 4.5.2, Java 1.8.0_66, QuickRex 4.3.0

Two Questions:

Is this the proper way to match these groups?

Is there a logical reason why the "5" is not included or a bug in the regex engine?

Answer

5 is not included because it cannot be matched due to the negative lookahead (?![.(&/)]*$) that makes the engine backtrack and find 2 only that is not followed with ., (, &, /, or ) symbols.

To match the 2 groups you need (as you mention, the 3rd one will be discarded anyway), you may turn the greedy * quantifier in the second group to a lazy one *? (to match as few any chars other than a newline before the firs occurrence of the subsequent subpattern) and turn the negative lookahead into a group (to make the .*? stop right before the pattern):

^([.(&)]*)(.*?)([.(&/)]*$)

See the regex demo

Details:

  • ^ - start of string/line
  • ([.(&)]*) - Group 1 capturing zero or more characters from the character class
  • (.*?) - any 0+ characters other than a newline as few as possible up to the first
  • ([.(&/)]*$) - ., or (, or &, or /, or ), zero or more occurrences up to the end of string.