Jason Morgan Jason Morgan - 2 months ago 6
SQL Question

Removing SQL Comments with Regular Expressions in R, while Preserving Special Tags

I need to remove comments beginning with

--
from strings of SQL statements in R. I am trying to do this with with regular expressions and
gsub
(though I am open to other suggestions). The complication is that the strings can contain special tags that begin with
--->>>
and end with
<<<---
, which I need to preserve for further processing.

Using lookaheads and lookbehinds, I have made some progress for those comments/tags that start at the beginning of the line:

> re <- "^(?!=<<<-){1}--(?!->>>){1}.*$"
>
> gsub(re, "", "-- test", perl=TRUE) # should be ""
[1] ""
> gsub(re, "", "--->>> test", perl=TRUE) # should be "--->>> test"
[1] "--->>> test"
> gsub(re, "", "<<<--- test", perl=TRUE) # should be "<<<--- test"
[1] "<<<--- test"
> gsub(re, "", "--->>>->>> test", perl=TRUE) # should be --->>>->>> test
[1] "--->>>->>> test"
> gsub(re, "", "---->>> test", perl=TRUE) # should be ""
[1] ""
> gsub(re, "", "test --->>> test", perl=TRUE) # should be "test --->>> test"
[1] "test --->>> test"
> gsub(re, "", "test --->>> test <<<---", perl=TRUE) # should be "test --->>> test <<<---"
[1] "test --->>> test <<<---"


But this doesn't, obviously, work with comments elsewhere in the string:

> gsub(re, "", "test1 -- test", perl=TRUE) # should be "test1"
[1] "test1 -- test" # WRONG


Removing the
^
at the beginning of the regular expression breaks most of test cases:

> re <- "(?!=<<<-){1}--(?!->>>){1}.*$"
> gsub(re, "", "-- test", perl=TRUE) # should be ""
[1] ""
> gsub(re, "", "test1 -- test", perl=TRUE) # should be "test1"
[1] "test1 "
> gsub(re, "", "--->>> test", perl=TRUE) # should be "--->>> test"
[1] "-" # WRONG
> gsub(re, "", "<<<--- test", perl=TRUE) # should be "<<<--- test"
[1] "<<<" # WRONG
> gsub(re, "", "--->>>->>> test", perl=TRUE) # should be --->>>->>> test
[1] "-" # WRONG
> gsub(re, "", "---->>> test", perl=TRUE) # should be ""
[1] ""
> gsub(re, "", "test --->>> test", perl=TRUE) # should be "test --->>> test"
[1] "test -" # WRONG
> gsub(re, "", "test --->>> test <<<---", perl=TRUE) # should be "test --->>> test <<<---"
[1] "test -" # WRONG


Anyone have suggestions how I can do this? I am open to any suggestions, but am constrained to R and have to keep the special tags,
--->>>
and
<<<---
.

Edit

As mentioned in the comments, this is also a test case:

> gsub(re, "", "-->>> test", perl=TRUE) # should be ""

Answer

I am posting the expression I shared in the comments since it turned out helpful. The idea behind it is that we can match specific substrings and then discard them from the match, and only match and keep something we want to remove using the part of the pattern after (*SKIP)(*FAIL) verbs.

Using

(?:(?<!-)--->>>|<<<---(?!-))(*SKIP)(*FAIL)|--.*

we match the following:

  • (?:(?<!-)--->>>|<<<---(?!-))(*SKIP)(*FAIL) - Either of the 2 sequences:
    • (?<!-)--->>> - --->>> not preceded with -
    • | - or
    • <<<---(?!-) - <<<--- not followed with -
    • (*SKIP)(*FAIL) - discard what was matched so far and proceed to the next match
  • | - or
  • --.* - 2 hyphens followed with 0+ chars other than a newline

See the regex demo at regex101.com.

Note that a dedicated parser would work much safer.

Comments