Z101 Z101 - 1 month ago 6
Java Question

Java regex line.split("\\s*//")

I came across the following string split line.split("\s*//")[0] but can't seem find documentation on the use of the '/' character in regular expressions.

Here my code:

String line = "type=path.composition id=pathComp";
line = line.split("\\s*//")[0];

Console console = System.console();
System.out.println("This is the line: " + line);


Here the output:

This is the line: type=rule.composition id=ruleComp


I am wondering what exactly '/' does to the regular expression and was wondering whether anybody would be able to point me to some documentation and/or an answer highlighting what it does?

I also noticed that when I remove the '//' from the regex, the output changes to merely the first character, which I suppose makes sense given that \s* means that the expression spits on zero or more white space characters.

This is the line: t


This however raises the question: "what does the '//' add to the regular expression that sees the split occur at the end of the line"?

Any advice would be highly appreciated.

Z

fge fge
Answer

Consider your input text (type=rule.composition id=ruleComp), and your two regexes:

  • regex 1: \s*//;
  • regex 2: \s*.

When you try and .split() against a regular expression, the regex engine will try and match the regular expression (which is computed from the text literal as an argument) and these two things can happen:

  • the regex cannot match anything (this is what happens with regex 1): the split effectively cannot operate and the 0th element is the input text;
  • the regex can match an empty string (this is what happens with regex 2): in this case, the regex engine notices this and cannot let the situation continue, since otherwise it would result in and endless loop. Therefore it forcefully advances by one token before proceeding.

Hence your results:

  • with the first regex, nothing is matched;
  • with the second regex, an empty string is matched; the regex engine chooses to shift one character and considers the "discarded" text (the previous token) as the 0th match.
Comments