brittenb brittenb - 8 months ago 50
R Question

How to convert character to upper case using perl regex and | operator with gsub in R

Let's say I have the following strings:

x = c("123 w. main ave., city, st", "mr. smith", "456 main st.")

I want to be able to capitalize certain portions of the string that I know should be capitalized. I thought I could achieve this using
with the following approach:

gsub("(m)(rs?\\. )|( a)(ve\\.[\\s,])|( s)(t\\.[\\s,$])", "\\U\\1\\L\\2", x, perl=T)

However, this results in the following:

# [1] "123 w. main city, st" "Mr. smith" "456 main"

In the first string, it removed the text that it matched because the regex groups that were matched in that string were
. In the second string it works as intended since it matched groups
. In the third string it did the same as the first for the same reason.

My desired outcome would be the following:

# [1] "123 w. main Ave., city, st", "Mr. smith", "456 main St."

My question, then, is how do you tell regex to replace with the groups that it found? Do I have to do a different regex for each instance?


I suggest using a branch reset group ((?|...|...)) and since the $ seems to denote the end of string, you need an alternation group (?:[\s,]|$) rather than [\s,$] character class.


x = c("123 w. main ave., city, st", "mr. smith", "456 main st.")
gsub("(?|(m)(rs?\\. )|( a)(ve\\.[\\s,])|( s)(t\\.(?:[\\s,]|$)))", "\\U\\1\\L\\2", x, perl=T)
## => [1] "123 w. main Ave., city, st" "Mr. smith" "456 main St." 

See this online R demo

Thanks to the branch reset group, all the capturing groups inside the group are indexed starting with 1 in each separate branch.