tia_0 tia_0 - 9 months ago 37
R Question

R - finding ordered patterns in strings with grep

I want to search particular patterns in a set of strings.

Given these two vector of strings:

actions <- c("taking","using")

nouns <- c("medication","prescription")

I want to find any combination of action + noun, in this particular order, not noun + action. For example, using the following text I want to detect the combination:

  • using medication

  • taking medication

  • using prescritpion

Using the following text:

phrases <- c("he was using medication",
"medication using it",
"finding medication",
"taking the left",
"using prescription medication",
"taking medication drug")

I have tried using
grep("\\b(taking|using+medication|prescriptio)\\b",phrases,value = FALSE)
but it's clearly wrong.

Answer Source

You may build the alternation groups using your actions and nouns values and put them into a bigger regular expression:

actions <- c("taking","using")
nouns <- c("medication","prescription")
phrases <- c("he was using medication","medication using it","finding medication","taking the left","using prescription medication","taking medication drug")
grep(paste0("(",paste(actions, collapse="|"), ")\\s+(", paste(nouns,collapse="|"),")"), phrases, value=FALSE)
## => [1] 1 5 6
## and a visual check
grep(paste0("(",paste(actions, collapse="|"), ")\\s+(", paste(nouns,collapse="|"),")"), phrases, value=TRUE)
## => [1] "he was using medication" "using prescription medication" "taking medication drug" 

See the online R demo

The resulting regex will look like


See the regex demo