Amanda Amanda - 3 months ago 7
R Question

Difference between (^|\\s)([A-Z]{1,3})(\\s|$) and \\b[A-Z]{1,2}\\b regular expressions in R

I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:

df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)

> df

original | filter1 | filter2 |
1 ABCDE FG H | ABCDE H | ABCDE |
2 IJKL MN OPQRS | IJKL OPQRS | IJKL OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA| TUV AAAA |


I don't understand why the first filter
(^|\\s)[A-Z]{1,2}($|\\s)
doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using
\\b[A-Z]{1,2}\\b
as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).

I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:

> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"


So, Why are the results different?

Answer

The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.

Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.

To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:

df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)

or

df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)

and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).

The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.