DV Hughes DV Hughes - 1 month ago 11
R Question

Regex for sub-string capture, with exception that sub-string may or may not be bound by escaped double-quotes

Suppose I have the following strings:


  1. "LAW Nº 1234/1998 - DATE 01/01/1998\"LAW TITLE HERE\"."

  2. "LEI Nº 1234/1998 - DATE 01/01/1998LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N"



Explanation:


  • A. I have some unique identifier in the sub-string ("LAW Nº
    NNNN/YYYY") followed by " - "

  • B. Then the DATE identifier, preceded by the word "DATE"

  • C. Then a standard continental-format date ("DD/MM/YYYY")

  • D. Finally a sub-string containing a document title



Note: The exception is that title sub-strings may or may not be contained in double-quotes.

Note: All titles have the following features; may or may not begin (or contain) alpha and numerical characters as well as punctuation (full stops / periods at end OR contain commas, semi-colons or colons among other punctuation).

My question: How can I, most efficiently, modify the constructed Perl-like regular expression below to handle the exception of the title sub-string not being captured by a double-quote? In short, I want to keep (or retain) the title sub-string from a string regardless of whether it is captured by a double-quote in the two types of strings listed above.

Current, Perl-like, regular expression:


'(?<=DATE \d{2}/\d{2}/\d{4}(\"|\s+))(.*)$'


Sample code & Data:

s1<- "LAW Nº 1234/1998 - DATE 01/01/1998\"LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N\"."

s2<-"LAW Nº 1234/1998 - DATE 01/01/1998LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N"

p<-'(?<=DATE \\d{2}/\\d{2}/\\d{4}(\"|\\S))(.*)$'

m1<-regexpr(p, s1,perl=T)

m2<-regexpr(p, s2,perl=T)

t1<-regmatches(s1, m1)

t2<-regmatches(s2, m2)

print(t1)

print(t2)


Returns:



  1. "LAW Nº 1234/1998 - DATE 01/01/1998\"LAW TITLE HERE MAY CONTAIN
    4|_|D4NUM3R!C OR P_NC7U@7|()N\"."


  2. "AW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N"



Current implementation problems, fixes needed:


  1. String 1 has a final '\"' which is an escaped double-quote that
    needs to be excluded from final output.

  2. Current regular expression construction excludes first
    non-whitespace character detected.



Desired output (same output from both sub-strings):



  1. "LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N."

  2. "LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N."




R Session-Info (base R, no additional packages):


R version 3.2.4 (2016-03-10) Platform: x86_64-apple-darwin13.4.0
(64-bit) Running under: OS X 10.11.5 (El Capitan)

Answer

The best approach with PCRE here is using a branch reset group, (?|...|...) and use a capturing group in each branch to only get the results into Group 1.

However, the regexec function that helps extracting captured group values in R does not accept a perl=TRUE argument, nor can we use the branch reset with ICU regex flavor in stringr str_match / str_match_all.

The most convenient way to use a branch reset here is via sub:

> p <- "(?s).*DATE \\d{2}/\\d{2}/\\d{4}(?|\"(.*)\".*|(.*))|.+"
> x <- c("LAW Nº 1234/1998 - DATE 01/01/1998\"LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N\".","LAW Nº 1234/1998 - DATE 01/01/1998LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N", "Something that does not match")
> sub(p, "\\1", x, perl=TRUE)
[1] "LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N"
[2] "LAW TITLE HERE MAY CONTAIN 4|_|D4NUM3R!C OR P_NC7U@7|()N"
[3] "" 

See the regex demo and an online R demo.

The pattern matches strings with our pattern (that is captured into Group 1) first with the first outer branch, or the whole string if there is no pattern we are interested in to completely remove it as a result.

Pattern details:

  • (?s) - enable . to match linebreak symbols
    • .* - matches any 0+ chars as many as possible up to the last
    • DATE \d{2}/\d{2}/\d{4} - DATE followed with a space, 2 digits, /, 2 digits, /, 4 digits
    • (?|"(.*)".*|(.*)) - branch reset group matching either
      • "(.*)".* - ", any 0+ chars as many as possible (Group 1), "
      • | - or
      • (.*) - Group 1 capturing any 0+ chars
  • | - or
    • .+ - just match a non-empty string, any 1 or more chars.
Comments