Pierre Laurent Pierre Laurent - 11 days ago 6
R Question

Using R, how to use str_extract properly on this case?

I've learned from Ronak Shah and akrun(in this post) how to construct a regular expression to exclude every terms from a dataframe (alldata in my example) except those words,


^\BWORD1|WORD2|WORD3|WORD4|WORD5\>


but for some reasons, can't figure why it is giving me


"WORD2", "WORD3", NA


instead of


"WORD1 WORD2 WORD5", "WORD3", NA


here is my example :

library(stringr)
alldata <- data.frame(toupper(c("word1 anotherword word2 word5", "word3", "none")))
names(alldata)<-"columna"
removeex <- c("word1" , "word2" ,"word3" ,"word4", "word5")
regularexprex <- toupper(paste0("^\\b",paste0(removeex, collapse = "|"), "\\>"))
alldata$columnb <- str_extract(alldata$columna, regularexprex)


I've tried to add + or * at the end of the regular expression but without any effects.

Due to the fact i'm a beginner on regex, i surely miss something, may someone guide me on this ?
Regards,

Answer

You need to replace the last two lines in your above code to

> regularexprex <- paste0("(?i)\\s*\\b(?!(?:",paste0(removeex, collapse = "|"), ")\\b)\\w+")
## => "(?i)\\s*\\b(?!(?:word1|word2|word3|word4|word5)\\b)\\w+"
> str_replace_all(alldata$columna, regularexprex, "")
[1] "WORD1 WORD2 WORD5" "WORD3"             ""   

First, the toupper() turned \b to \B (non-word boundary) - you just need a case insensitive matching (I added the (?i) modifier), and the word boundaries were not applied to the group, only to the items on the both sides.

Also, what you need is a pattern to match the whole string, so .* at the start and end of the pattern.

The final regex for replacing looks like

(?i)\s*\b(?!(?:word1|word2|word3|word4|word5)\b)\w+

See the regex demo

If your entries contain newlines, you should also add s modifier: (?i) -> (?s).

Details:

  • (?i) - case insensitive modifier (works with PCRE and ICU regexes)
  • \s* - 0+ whitespaces
  • \b - a leading word boundary
  • (?!(?:word1|word2|word3|word4|word5)\b) - the word cannot equal word1, etc.
  • \w+ - 1+ word chars (letters, digits or underscores).