Tyler Rinker Tyler Rinker - 2 months ago 9
R Question

regex match substring unless another substring matches

I'm trying to dig deeper into regexes and want to match a condition unless some substring is also found in the same string. I know I can use two

grepl
statements (as seen below) but am wanting to use a single regex to test for this condition as I'm pushing my understanding. Let's say I want to match the words "dog" and "man" using
"(dog.*man|man.*dog)"
(taken from here) but not if the string contains the substring "park". I figured I could use
(*SKIP)(*FAIL)
to negate the "park" but this does not cause the string to fail (shown below).


  • How can I match the logic of find "dog" & "man" but not "park" with 1 regex?

  • What is wrong with my understanding of
    (*SKIP)(*FAIL)|
    ?



The code:

x <- c(
"The dog and the man play in the park.",
"The man plays with the dog.",
"That is the man's hat.",
"Man I love that dog!",
"I'm dog tired",
"The dog park is no place for man.",
"Park next to this dog's man."
)

# Could do this but want one regex
grepl("(dog.*man|man.*dog)", x, ignore.case=TRUE) & !grepl("park", x, ignore.case=TRUE)

# Thought this would work, it does not
grepl("park(*SKIP)(*FAIL)|(dog.*man|man.*dog)", x, ignore.case=TRUE, perl=TRUE)

Answer

You can use the anchored look-ahead solution (requiring Perl-style regexp):

grepl("^(?!.*park)(?=.*dog.*man|.*man.*dog)", x, ignore.case=TRUE, perl=T)

Here is an IDEONE demo

  • ^ - anchors the pattern at the start of the string
  • (?!.*park) - fail the match if park is present
  • (?=.*dog.*man|.*man.*dog) - fail the match if man and dog are absent.

Another version (more scalable) with 3 look-aheads:

^(?!.*park)(?=.*dog)(?=.*man)
Comments