I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g.
words2keep <- c("ok", "hello", "yes*")
text <- gsub("[A-Z,a-z]", "", text)
text = "ok אני מסכים איתך Yossi Cohen"
text = "ok אני מסכים איתך"
This is a tricky one. I think we can do it by matching against whole words by making use of the
\b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:
gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);  "ok אני מסכים איתך "