PatrickD PatrickD - 4 months ago 19
Java Question

Java is ignoring regex to remove duplicate lines using BlueJ

Really green here. I am trying to get a regex that works in Notepad++ to run in Java using BlueJ, but Java seems to be ignoring it. I am using other replaceAll functions using regular expressions, and all of those are working.

I have this, but it is telling me the \s is an illegal escape character:

itemList[i] = itemList[i].replaceAll("^(\s*\r\n){2,}", "\r\n");


I read about the Java engine and changed the \s to \s so it wasn't illegal:

itemList[i] = itemList[i].replaceAll("^(\\s*\r\n){2,}", "\r\n");


I tried using [[:space:]] instead, however, it still doesn't do the replace function.

itemList[i] = itemList[i].replaceAll("^([[:space:]]*\r\n){2,}", "\r\n");


This Java tool is processing hundreds of lines, and people are having issues using Notepad++ to remove the duplicate lines. I thought maybe doing it in the formatting tool would eliminate the issues. Here is an example of the text:

1. Modification: No Error Message When SQL Server Down

S9# 395


Summary

No error message when the SQL Server is
down.

Workaround

There is currently no
workaround for this issue. The system will become
unusable if SQL server is down.

Answer

You need to use multiline mode, so ^ can match the beginning of any line. Otherwise it only matches the beginning of the whole string. Multiline mode is the default in most text editors, but using regexes anywhere else, you have to specify it. Just add (?m) to the beginning of the regex:

(?m)^(\\s*\r\n){2,}

If you're running Java 8, I recommend doing this instead:

replaceAll("(?m)^(?:\\h*(\\R)){2,}", "$1")

\s* is ambiguous, because it can match newlines as well as spaces; \h only matches horizontal whitespace (e.g., spaces and tabs).

\R matches any kind of newline: \r\n, \n, \r, or several other, less common ones. The inner group, (\R), captures the last of the redundant newlines, and "$1" plugs it back in. This way, you don't get any nasty surprises if someone changes the newline format of your documents.