MGorgon MGorgon - 1 year ago 66
Java Question

Regex reused word boundary from first match in second match - why?

Given string:

String str = "STACK 2013 OVERFLOW3";

And pattern:

Pattern pattern = Pattern.compile("\\b\\w+\\s\\b");

The output is:


Why? I read that once a character is used in the match, it can't be used again in next match.

But here we have first match for

\b used for boundary (before word STACK)

\w+ used for STACK word

\s used for space after STACK

\b used for boundary (before word 2013)

This results, as expected, in match "STACK ".

And then we have second match for "\b\w+\s\b":

\b used for boundary (before word 2013) <--- HERE this boundary is used second time

\w+ used for 2013 word

\s used for space after 2013

\b used for boundary (before word OVERFLOW3)

Why word boundary before word "2013" is used twice in these matchings?

Full code to reproduce:

public static void main(String[] args) {
String str = "STACK 2013 OVERFLOW3";
Pattern pattern = Pattern.compile("\\b\\w+\\s\\b");
Matcher matcher = pattern.matcher(str);
while (matcher.find()) {

Answer Source

First of all, there are some good examples and descriptions of a word boundary in this SO post. Exact positions where a word boundary matches are outlined there, and in this regex tutorial.

However, your question is why does \b match at one and the same location twice?

The answer is that a word boundary belongs to the group of non-consuming patterns that do not add text they match to the output and they do not make the regex index advance to the end of the pattern matched, they just assert if there is something before or after the their patterns. In other words, these are zero-width assertions (as already mentioned by Sebastian Proske).

Non-consuming patterns are lookarounds, anchors, and the word boundaries.

So, what happens when your regex reaches the end of STACK ? The trailing \b matches the position before 2013, but the regex index is still there, before 2013. The first match is returned, the next match starts at the same position before 2013. The first / leading \b in the pattern asserts true as the position before 2013 is a word boundary (after a non-word and before a word char).

The point that \b is a zero-width assertion can also be illustrated by using it inside a lookaround: the results are the same with both lookbehind and lookahead: \b = (?<=\b) = (?!\b). They all give the same results.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download