Fisseha Berhane Fisseha Berhane - 1 month ago 15
R Question

using regular expression to combine words in R

I have unstructured text and I want to combine some words so as to preserve the concept for my text mining task. Example, in the strings below, I want to change "High pressure" in to "High_pressure", "not working" in to "not_working" and "No air" into "No_air".

Example text

c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")


List of words

c('low', 'high', 'no', 'not')


Desired output

# [1] " High_pressure was the main problem in the machine"
# [2] "the system is not_working right now"
# [3] "No_air in the system"

Answer

First, saving the text input and the list of modifying words you want to concatenate:

textIn <- 
  c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")

prefix <- c("high", "low", "no", "not")

Then, build a regex that captures those words followed by a space. Note that I am using \b to ensure that we don't accidentally capture those as the ends of words e.g. "slow"

gsub(
  paste0("\\b(", paste(prefix, collapse = "|"),") ")
  , "\\1_", textIn, ignore.case = TRUE
)

returns

[1] " High_pressure was the main problem in the machine"
[2] "the system is not_working right now"          
[3] "No_air in the system"