Jsmith Jsmith - 2 months ago 8
Java Question

Java replaceAll do not replace string

I am parsing through some XML and sanitizing some fields.

I'm trying to do the following in Java:

nameField = nameField.replaceAll("[^a-zA-Z\\d\\s\\.,'&]", "");


I do not want to replace any letters of the alphabet, any number, any whitespace, any period, any comma, any single quote or (this is where my issue is) the literal string
&
.

But I do want to replace occurrences of a single
&
or a single
;


But obviously my Regex as it sits won't work. It'll leave in all
&
and all
;
.

For example, say the string of
K&W@#9$9(AR;.0 O&
is found, my expected result would be:
KW99AR.0 O&
.

How can I achieve this?

Answer

Why don't you simplify your regular expression and just go with a lookahead/lookbehind:

//                  |"&" not followed by "amp;"
//                  |          | or
//                  |          | ";" not preceded by "&amp"
nameField.replaceAll("&(?!amp;)|(?<!&amp);", "");

The output for "K&W@#9$9(AR;.0 O&amp;" would be:

KW@#9$9(AR.0 O&amp;

Edit

Then, you can chain this with a cleanup, leaving your desired characters only. Here, I added the ; and & to the exclude list, since they're already cleaned up when "standalone" by the previous operation.

Also, you don't need to escape the dot in a custom character class.

.replaceAll("[^a-zA-Z\\d\\s.,;&]", "");

The two chained invocations will return:

KW99AR.0 O&amp;

Notes

  • As mentioned by Tushar, sequences of characters in a custom character class are not considered as sequences but alternate individual characters.
  • General rule of thumb: careful about using regex to parse markup. You may very well end up with a bigger mess. Regular expressions are not made to parse markup or languages with a grammar.
  • Your specific case is safe enough, but remember there are other XML entities such as &gt;, &lt; etc.