John Chrysostom John Chrysostom - 2 months ago 12
R Question

Invalid regexp in R

I'm trying to use this regexp in R:

\?(?=([^'\\]*(\\.|'([^'\\]*\\.)*[^'\\]*'))*[^']*$)


I'm escaping like so:

\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)


I get an
invalid regexp
error.

Regexpal has no problem with the regex, and I've checked that the interpreted regex in the R error message is the exact same as what I'm using in Regex pal, so I'm sort of at a loss. I don't think the escaping is the problem.

Code:

output <- sub("\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)", "!", "This is a test string?")

Answer

R by default uses the POSIX (Portable Operating System Interface) standard of regular expressions (see these SO posts [1,2] and ?regex [caveat emptor: machete-level density ahead]).

Look-ahead ((?=...)), look-behind ((?<=...)) and their negations ((?!...) and (?<!...)) are probably the most salient examples of PCRE-specific (Perl-Compatible Regular Expressions) forms, which are not compatible with POSIX.

R can be trained to understand your regex by activating the perl option to TRUE; this option is available in all of the base regex functions (gsub, grepl, regmatches, etc.):

output <- sub("\\?(?=([^'\\\\]*(\\\\.|'([^'\\\\]*\\\\.)*[^'\\\\]*'))*[^']*$)",
              "!", "This is a test string?", perl = TRUE)