Dmitry Leykin Dmitry Leykin - 3 months ago 23
R Question

gsub with exception in R

I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g.

words2keep <- c("ok", "hello", "yes*")
.
So my current regex is
text <- gsub("[A-Z,a-z]", "", text)
, but the question is how to add the exception so it will not remove all English words.

reproducibe example:

text = "ok אני מסכים איתך Yossi Cohen"


after gsub with exception

text = "ok אני מסכים איתך"


Thank you for all suggestions

Answer

This is a tricky one. I think we can do it by matching against whole words by making use of the \b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:

gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);
[1] "ok אני מסכים איתך  "