I am trying to learn regular expressions and got confused.
I saw this post java split () method
so I have some questions regarding to the 2nd answer by Achintya Jha;
str2.split("");
[, 1, 2, 3]
""
(?!^)
a(?!b)
a
b
^
(?!^)
""
^
""
""
""
""
Split happens in places which matches regex passed as argument. You need to know that if split happens ONE thing becomes TWO things. Always. There is no exception.
You can doubt it because of instance "abc".split("c")
returns array with one element ["ab"]
but that is because this version of split
also automatically removes trailing empty strings from array before returning it.
In other words "abc".split("c")
["ab",""]
array (yes there is empty string which is result of splitting "abc"
on c
), ["ab"]
Another example would be splitting "abc"
on "a"
. Since a
is present at start you will get ["", "bc"]
.
But splitting on empty String is little bit more tricky, because empty string is before and after each characters. I will mark them using pipe |
.
So empty Strings in "abc"
can be found at these positions "|a|b|c|"
which means that when you split "abc"
on ""
["", "a", "b", "c", ""]
That is why "abc".split("")
returns as result array ["", "a", "b", "c"]
(this should answer your question 1).
But what if we want to prevent first empty string (the one at start) from being matched by split method? In other words what if we don't want to split on
"|a|b|c|"
but only on
"a|b|c|"
We can do it in few ways.
a|
b|
c|
. To create such regexes we will need look-around mechanisms.
""
(?<=.)
. If we will combine previous two pints: "(?<=.)"
and ""
we will get "(?<=.)"+""
which is simply "(?<=.)"
so "abc".split("(?<=.)")
should split only on these empty strings which are preceded by any character (in regex represented by dot .
).
To say that something can't stay at start of the string we can use negative-look-behind (?<!...)
and ^
which represents start of the string. So (?<!^)
represents condition "has no beginning of string before it". That is why "(?<!^)
cant match this white space
↓
"|a|b|c|"
since it has start of the string before it.
Actually there is also one special case which is main point of your question (?!^)
which means negative-look-ahead. This regex describes empty string which do not have start of the string after it. It is kind of unintuitive, because previously we assumed that start of the string (represented by ^
) is placed here
↓
"^|a|b|c|"
but now it looks like it is here:
↓
"|^a|b|c|"
So what is going on? How does it works?
As I told earlier splitting on empty strings is tricky. To understand this you need to take a look at string without marked empty strings and you will see that start of the string is here
↓
"^abc"
In other words, regex also considers place right before first character (in our case "a"
) as its start, so
↓
"|^a|b|c|"
makes also sense and is valid, which is why (?!^)
is able to see this empty string
↓
"|^a|b|c|"
as right before start of the string and will not accept it as valid place to split.
ANYWAY Since this was causing confusion for developers who ware not very familiar with regex, from Java 8 we don't have to use trick with (?<=.)
or (?<!^)
or (?!^)
to avoid creating empty string at the beginning, because as described in this question
Why in Java 8 split sometimes removes empty strings at start of result array?
it automatically removes empty string at start of generated array as long regex used in split
represents zero-length string (like empty string), so you now will be able to use "abc".split("")
and get as result ["a", "b", "c"]
.