 Amanda - 4 years ago 108
R Question

# Difference between (^|\\s)([A-Z]{1,3})(\\s|\$) and \\b[A-Z]{1,2}\\b regular expressions in R

I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:

``````df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df\$filter1 = gsub("(^|\\s)[A-Z]{1,2}(\$|\\s)", " ", df\$original)
df\$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df\$original)

> df

original |    filter1 |    filter2  |
1     ABCDE FG H |    ABCDE H |    ABCDE    |
2  IJKL MN OPQRS | IJKL OPQRS | IJKL   OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA|  TUV   AAAA |
``````

I don't understand why the first filter
`(^|\\s)[A-Z]{1,2}(\$|\\s)`
doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using
`\\b[A-Z]{1,2}\\b`
as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).

I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:

``````> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
 "XXXXBBBBCCCDDDDXXXXXXXEEE"
``````

So, Why are the results different? Wiktor Stribiżew

The point is that `gsub` can only match non-overlapping strings. ` FG ` being the first expected match, and ` H` the second, you can see that these strings overlap, and thus, after `"(^|\\s)[A-Z]{1,2}(\$|\\s)"` consumes the trailing space after `FG`, `H` just does not match the pattern.

Look: `ABCDE FG H` is analyzed from left to right. The expression matches ` FG `, and the regex index is right before `H`. There is only this letter to match, but `(^|\s)` requires a space or the start of string - there is none at this location.

To "fix" this and use the same logic, you can use a PCRE regex `gsub` with lookarunds:

``````df\$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=\$|\\s)", " ", df\$original, perl=TRUE)
``````

or

``````df\$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df\$original, perl=TRUE)
``````

and if you need to actually consume (to remove) spaces, just add `\\s*` before (or/and after).

The second expression `"\\b[A-Z]{1,2}\\b"` contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both `FG` and `H` since the spaces are not consumed.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download