Tom Bailey Tom Bailey - 13 days ago 9
R Question

Extract string between multiple words, using gsub

I am trying to isolate words from a string in R using -gsub-. I want to extract a name that can be found between either "(" and "(m)" (for males) or between "(" and "(f)". I am struggling to incorporate in one line of code.

name<-c("Dr. T. (Tom) Bailey (m), UCL- Physics" , "Dr. B.K. (Barbara) Blue (f), Oxford - Political Science")

malename<-gsub(".*\\) (.*) \\(m).*", "\\1", name)
femname<-gsub(".*\\) (.*) \\(f).*", "\\1", name)


The code above gives me the names for males and females separately, but ideally I want to obtain their lastname in one variable. This would involve some OR function (so (m) OR (f)), but I don't know how to incorporate this.

Answer

If you need to match either m or f, the best way to match them is a character class (or, in POSIX terminology, a bracket expression): [mf].

Your regex will look like

".*\\)\\s+(.*)\\s+\\([mf]\\).*"
                     ^^^^

See the regex demo

You may use the regex with sub to make sure only one regex match and replacement are performed (see online demo):

name<-c("Dr. T. (Tom) Bailey (m), UCL- Physics" , "Dr. B.K. (Barbara) Blue (f), Oxford - Political Science")
res <- sub(".*\\)\\s+(.*)\\s+\\([mf]\\).*", "\\1", name)
res
## => [1] "Bailey" "Blue"  
Comments