syre syre - 14 days ago 7
R Question

Grep in R to find words with custom "extended" boundaries

I'm looking for a regular expression to grep whole words, including words separated by digits or underscore.

\\b
considers digits and underscore as parts of words, not as boundaries.

For example, I'd like to catch MOUSE in "DOG MOUSE CAT", in "DOG MOUSE:CAT" but also in "DOG_MOUSE9CAT" and at the end or the beginning of an expression, as in "MOUSE9CAT" and "DOG_MOUSE". Basically, the boundary I'm looking for is any non-uppercase-alpha character plus beginning and end of line/expression (maybe missing some other cases caught by
\\b
here).

I've tried:

"[[0-9_]\\b]MOUSE[[0-9_]\\b]"
"[[0-9_]|\\b]MOUSE[[0-9_]|\\b]"
"[$|[^A-Z]]MOUSE[^|[^A-Z]]"
"[?<=^|[^A-Z]]MOUSE[?=$|[^A-Z]]"


None of them work.

I'm actually looking for several words (based on a long vector of values), so the final result should look something like

grep(paste("\\b", paste(searchwords, collapse = "\\b|\\b"), "\\b"), targettext)


(with a different delimiter because
\\b
is too restrictive for me).

(This is a similar question to the one asked by user Nick Sabbe's in a comment here: Using grep in R to find strings as whole words (but not strings as part of words))

Answer

Use PCRE regex with lookarounds:

grep("(?<![A-Z])MOUSE(?![A-Z])", targettext, perl=TRUE)

See the regex demo

The (?<![A-Z]) negative lookbehind will fail the match if the word is preceded with an uppercase ASCII letter and the negative lookahead (?![A-Z]) will fail the match if the word is followed with an uppercase ASCII letter.

To apply the lookarounds to all the alternatives you have, use an outer grouping (?:...|...).

See the R online demo:

> targettext <- c("DOG MOUSE CAT","DOG MOUSE:CAT","DOG_MOUSE9CAT","MOUSE9CAT","DOG_MOUSE")
> searchwords <- c("MOUSE","FROG")
> grep(paste0("(?<![A-Z])(?:", paste(searchwords, collapse = "|"), ")(?![A-Z])"), targettext, perl=TRUE)
[1] 1 2 3 4 5