Arturas M Arturas M - 7 months ago 11
Java Question

Replacing with this pattern doesn't work as I would expect it to, what's wrong?

I need help on extracting some words from this sentence:

String keywords = "I like to find something vicous in somewhere bla bla bla.\r\n" +
"https://address.suffix.com/level/somelongurlstuff";


And my matching code looks somewhat like this:

keywords = keywords.toLowerCase();
regex = "(I like to find )(.*)( in )(.*)(\\.){1}(.*)";
regex = regex.toLowerCase();
keywords = keywords.replaceAll(regex, "$4 $2"); //"$4 $2");


And I want to extract the words between
find
and
in
and between
in
and the first dot. however, as the url has multiple dots, some weird stuff starts happening and I get what I need PLUS the url wit dots replaced with empty spaces. I want the url to be gone, because it's supposed to be the matched with
(.*)
in my case, and I only need one dot after my words with
(\\.){1}
, so I wonder what's going wrong there? Any ideas?

By adding
(?s)
or doing removing all new line characters on the line before matching on the regex gives you something like:
somewhere bla bla bla address suffix something vicious
so the problem with the url without having dots still being left there persists.

This is NOT just about matching multiline text.

Answer

You need two things to fix: 1) add the DOTALL modifier since you have text that spans across multiple lines and 2) use lazy dot matching or - more efficient - a negated character class [^.] to match characters up to the first . after in:

(?s)(I like to find )(.*)( in )([^.]*)(\.)(.*)
                               ^^^^^^^

See the regex demo

However, the best one would be this one:

(?s)(I like to find )(.*?)( in )([^.]*)(\.)(.*)

The reluctant (lazy) quantifier makes the engine match as few characters as possible between the lazily quantified subpattern and the next subpattern. If we use .* before ( in ), backtracking will occur, that is, the whole string after "I like to find " will be grabbed by the regex engine, and then the engine will move backwards looking for the last in . Thus, using .*? will match up to the first in .

Instead of [^.]* you can use a . with a reluctant quantifier *? to match up to the first dot, but it is costlier in terms of performance since the engine expands the subpattern upon each fail it comes across when trying to match the string with the subsequent subpatterns.

Check my answer for Perl regex matching optional phrase in longer sentence to understand how greedy and lazy (=reluctant) quantifiers work.