JensD JensD - 6 months ago 16
Java Question

Regex understand \b

I am struggling to understand word boundary \b in regex.
I read that there are three conditions for \b.


  • Before the first character in the string, if the first character is a
    word character.

  • After the last character in the string, if the last character is a
    word character.

  • Between two characters in the string, where one is a word character
    and the other is not a word character.



I am trying to find the start index of the previous match using the java method start()

import java.util.regex.*;
class Quetico{
public static void main(String[] args){
Pattern p = Pattern.compile(args[0]);
Matcher m = p.matcher(args[[1]]);
System.out.print("match positions: ");
while(m.find()){
System.out.print(m.start()+" ");
}
System.out.println();
}
}


% java Quetico "\b" "^23 *$76 bc"

//string: ^23 *$76 bc pattern:\b
//index : 01234567890


produces: 1 3 5 6 7 9

I'm having trouble understanding why is produces this result. Because I'm struggling to see the pattern. Ive tried looking at the inverse, \B which produces 0 2 4 8 however this doesn't make it any clearer for me. If you can help clarify this for me it would be appreciated.

ajb ajb
Answer

The issue isn't Java here, it's Linux/Unix. When you put text between double quote marks on the command line, most of the special shell characters such as *, ?, etc. are no longer special--except for variable interpolation. (And some other things, like ! depending on which shell flavor you're using.) Thus, if you say

% command "this $variable is interesting"

if you've set variable to value, your command will be called with one argument, this value is interesting. In your case, Linux will treat $7 as a shell script parameter, even though you're not in a shell script; since this isn't set to anything, it's replaced with an empty string, and the result is the same as if you had run

% java Quetico "\b" "^23 *6 bc"

which gives me 1 3 5 6 7 9 if I use that string literal in a Java program (instead of on the command line).

To prevent $ from being interpreted by the shell, you need to use single quote marks:

% java Quetico "\b" '^23 *$76 bc'