nablex nablex - 3 years ago 77
Java Question

String.replaceAll() anomaly with greedy quantifiers in regex

Can anyone tell me why

System.out.println("test".replaceAll(".*", "a"));

Results in


Note that the following has the same result:

System.out.println("test".replaceAll(".*$", "a"));

I have tested this on java 6 & 7 and both seem to behave the same way.
Am I missing something or is this a bug in the java regex engine?

fge fge
Answer Source

This is not an anomaly: .* can match anything.

You ask to replace all occurrences:

  • the first occurrence does match the whole string, the regex engine therefore starts from the end of input for the next match;
  • but .* also matches an empty string! It therefore matches an empty string at the end of the input, and replaces it with a.

Using .+ instead will not exhibit this problem since this regex cannot match an empty string (it requires at least one character to match).

Or, use .replaceFirst() to only replace the first occurrence:

"test".replaceFirst(".*", "a")

Now, why .* behaves like it does and does not match more than twice (it theoretically could) is an interesting thing to consider. See below:

# Before first run
regex: |.*
input: |whatever
# After first run
regex: .*|
input: whatever|
#before second run
regex: |.*
input: whatever|
#after second run: since .* can match an empty string, it it satisfied...
regex: .*|
input: whatever|
# However, this means the regex engine matched an empty input.
# All regex engines, in this situation, will shift
# one character further in the input.
# So, before third run, the situation is:
regex: |.*
input: whatever<|ExhaustionOfInput>
# Nothing can ever match here: out

Note that, as @A.H. notes in the comments, not all regex engines behave this way. GNU sed for instance will consider that it has exhausted the input after the first match.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download