user3018793 user3018793 - 1 month ago 8
Java Question

Java match the same group multiple times

I need to match in a sequence of characters the same pattern multiple times.

Eg: For the input

Some words <firstMatch> some words <secondMatch> some more words <ThirdMatch>
I would need
<firstMatch>
,
<secondMatch>
,
<thirdMatch>


I have tried something like this:

String input = "Some words <firstMatch> some words <secondMatch> some more words <ThirdMatch>";
Pattern pattern = Pattern.compile( ".*(\\<.*\\>).*" );
Matcher m = pattern.matcher( input );
while ( m.find() ) {
System.out.println( m.group( 1 ) );
}


All I get is
ThirdMatch


Any help?

Answer

Why does your pattern fail?

.*(\\<.*\\>).* invloves a lot of backtracking. First, .* matches any 0+ chars other than linebreak characters, basically the whole line. Then, the regex engine backtracks trying to accommodate for the subsequent pattern, (<.*>).*. When it finds the < (first from the end), it will again grab the whole line, and will go on backtracking searching for the >. Once found, the last .* just matches the rest of the line. Note that if the engine fails to find that > after <, backtracking will repeat the search, making this pattern rather inefficient. Note: < and > do not have to be escaped in a Java regex pattern, they are not special regex metacharacters.

Solution

Use a simpler "<[^>]*>" pattern based on a negated character class:

String input = "Some words <firstMatch> some words <secondMatch> some more words <ThirdMatch>";
Pattern pattern = Pattern.compile( "<[^>]*>" );
Matcher m = pattern.matcher( input );
while ( m.find() ) {
   System.out.println( m.group(0) ); // = m.group(), the whole match value
}

See the Java demo

The <[^>]*> will match <, 0+ chars other than >, and then >. Since you are using Matcher#find() in a while block, you will find all non-overlapping matches in the input string, but you need to access .group(0) (equal to .group(), the whole match value), not .group(1).