Bharath Bharath - 5 months ago 43x
Java Question

Why String.ReplaceAll() in java requires 4 slashes "\\\\" in regex to actually replace "\"?

I recently noticed that, String.replaceAll(regex,regex) behaves very weirdly when it comes to the escape-character "\"(slash)

For example consider there is a string with filepath -

String text = "E:\\dummypath"

and we want to replace the

gives the output
raises the exception

If we want to implement the same functionality with
we need to write it as,

One notable difference is
has its arguments as reg-ex whereas
has arguments character-sequence!

works exactly the same as its char-sequence equivalent

Digging Deeper:
Even more weird behaviors can be observed when we try some other inputs.

Lets assign

all these three gives the same output

Java had really messed up with the reg-ex in its best possible way I feel! No other language seems to have these playful behaviors in reg-ex. Any specific reason, why Java messed up like this?


@Peter Lawrey's answer describes the mechanics. Basically, the "problem" is that backslash is an escape character in both Java string literals, and in the mini-language of regexes. So when you use a string literal to represent a regex, there are two sets of escaping to consider ... depending on what you want the regex to mean.

But why is it like that?

Basically, it is a historical thing. Java originally didn't have regexes at all. The syntax rules for Java String literals were borrowed from C / C++, which also didn't have built-in regex support. Awkwardness of double escaping didn't become apparent in Java until they added regex support in the form of the Pattern class ... in Java 1.4.

So how do other languages manage to avoid this?

Basically, they do it by providing direct or indirect syntactic support for regexes. For instance, in Perl, Ruby, Javascript and many other languages, there is a syntax for patterns / regexs (e.g. '/pattern/') where string literal escaping rules do not apply. In C# and Python, they provide an alternative "raw" string literal syntax with similar properties. (But note that if you use the normal C# / Python string syntax, you have the Java problem of double escaping.)

Why do text.replaceAll("\n","/"), text.replaceAll("\\n","/"), and text.replaceAll("\\\n","/") all give the same output?

The first case is a newline character at the String level. The Java regex language treats all non-special characters as matching themselves.

The second case is a backslash followed by an "n" at the String level. The Java regex language interprets a backslash followed by an "n" as a newline.

The final case is a backslash followed by a newline character at the String level. The Java regex language doesn't recognize this as a specific (regex) escape sequence. However a backslash followed by any non-alphabetic character means the latter character. So, a backslash followed by a newline character ... means the same thing as a newline.