Nick Gammon Nick Gammon - 2 months ago 16
PHP Question

Catastrophic backtracking when using PCRE regexp to validate SQL in PHP

I'm trying to validate some SQL in the general form:

UPDATE `mytable` SET `keyname` = 'keyvalue',
`a` = 'somestring',
`b` = 123,
`c` = NULL
WHERE `keyname` = 'keyvalue'


There are more fields than this. The values will be strings, integers or NULL.

My original regexp is this:

(?ix)

^

\s*
UPDATE \s+ `mytable` \s+
SET \s+ `keyname` \s = \s 'keyvalue'
(, \s+
`[A-Z_]+` (?# field name)
\s+ = \s+ (?# equals value)
(
-?[0-9]+ (?# an integer, possibly negative)
|
'(\\.|''|[^'])*' (?# a string in single quotes)
|
NULL (?# NULL)
)
)+ (?# one or more such assignments)

\s+ WHERE \s+ `keyname` \s+ = \s+ 'keyvalue'

$


This works up to a point. According to https://regex101.com/ it matches in 180 steps.

Unfortunately the real SQL is lengthier than that, for example:

UPDATE `mytable`
SET `keyname` = 'keyvalue',
`Markup` =
'Lorem ipsum dolor sit amet, consectetur adipiscing elit.
''Quisque vel mattis odio, quis iaculis sem.''
Nulla facilisi.
Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere
cubilia Curae; Fusce ut dui venenatis, maximus lorem eget, ornare ex.
Aenean tempus pulvinar est, id fringilla enim sagittis id. Mauris finibus
cursus commodo.\r\n\r\n
Vestibulum ante ipsum primis in faucibus orci luctus et ultrices posuere
cubilia Curae; Fusce ut dui venenatis, maximus lorem eget, ornare ex.
Aenean tempus pulvinar est, id fringilla enim sagittis id. Mauris finibus
cursus commodo.\r\n\r\n\r\n
\'Aenean in augue a est vulputate accumsan.\'
Phasellus nulla diam, laoreet a elit non, mattis finibus magna. Phasellus
faucibus iaculis mi sed pulvinar.\r\n
Aliquam non nisl ultricies, aliquam augue vitae, efficitur sapien.
Etiam viverra, magna a laoreet sollicitudin, ipsum erat tincidunt sem, nec
faucibus enim tortor eget massa.
Nunc nisi orci, lacinia vitae dictum et, vestibulum sed metus. ',
`From_Date` = NULL,
`To_Date` = NULL,
`Foo` = '',
`Box_Colour` = NULL,
`Modification_Date` = '2016-09-08 12:30:47',
`Modified_User` = 1,
`Modified_IP` = '192.168.1.1'
WHERE `keyname` = 'keyvalue'


That now takes 4301 steps. In fact if you make the Lorem Ipsum larger we reach over 20000 steps.

Also if we introduce an error (to make it not match), for example changing:

`Foo` = '',


to

`Foo` = ''


It now crashes with Catastrophic backtracking.

I can get rid of the Catastrophic backtracking (to an extent) by making the inner group (the key/value pairs) an atomic group. That is, change:

SET \s+ `keyname` \s = \s 'keyvalue'
(, \s+


to

SET \s+ `keyname` \s = \s 'keyvalue'
(?>, \s+


The 20000+ steps on real data are causing the PHP script to crash when run on my target web server. I need to get the steps down to some more realistic value. I can't quite see why there is so much backtracking when it looks like the regexp is reasonably explicit. Fiddling with possessive quantifiers, or atomic groups seems to either do nothing, or cause either the "pass" or "fail" SQL to not match correctly.

sln sln
Answer

edit: For the string sub-expression:
Using the unrolled loop version by @NikiC,
replacing \s with \s* where needed,
and adding an additional atomic group,
it gets down to a reasonable amount of steps.

https://regex101.com/r/yV5xI7/3

 (?si)

 ^ 

 \s* 
 UPDATE \s+ `mytable` \s+ 
 SET \s+ `keyname` \s* = \s* 'keyvalue'

 (?# one or more such key = value  )
 (?>
      \s* , \s* 

      (?# field name )
      ` [A-Z_]+ `  

      (?# equals )
      \s* = \s*  

      (?# value )
      (?>
           (?# an integer, possibly negative )
           -? [0-9]+         
        |  
           (?# or, a string )
           '
           [^'\\]* 
           (?:
                (?: \\ . | '' )
                [^'\\]* 
           )*
           '

           # '
           # (?: [^'\\] | '' | \\ . )*
           # ' 

        |  
           (?# or, literal NULL )
           NULL             
      )
 )+

 \s+ WHERE \s+ `keyname` \s* = \s* 'keyvalue'

 $

You could try to wrap the quantified core in an atomic group.
Also, you probably need the dot-all modifier for any reference to dot ..

And, you should exclude the escape if you are matching it in one of the
alternations in your string sub-expression.
Engines have been known to take the path that matches.
In this case, [^'] also matches the escape and it looks like
the escape should not be allowed to exist by itself.
And, there may actually be a valid escape + newline sequence, hence the (?s).
I.e. use (?: \\ . | '' | [^'\\] )*.

Putting it all together, here is one for ideone

/
     (?si)
     ^ \s* UPDATE \s+ `mytable` \s+ SET \s+ `keyname` \s = \s 'keyvalue'
     (?>
          , \s+ ` [A-Z_]+ `
          (?# field name )
          \s+ = \s+ 
          (?# equals value )
          (?:
               -? [0-9]+ 
               (?# an integer, possibly negative )
            |  '
               (?: \\ . | '' | [^'\\] )*
               '
               (?# a string in single quotes )
            |  NULL
               (?# NULL )
          )
     )+
     (?# one or more such assignments )
     \s+ WHERE \s+ `keyname` \s+ = \s+ 'keyvalue' $
/x

And one for php

'/
     (?si)
     ^ \s* UPDATE \s+ `mytable` \s+ SET \s+ `keyname` \s = \s \'keyvalue\'
     (?>
          , \s+ ` [A-Z_]+ `
          (?# field name )
          \s+ = \s+ 
          (?# equals value )
          (?:
               -? [0-9]+ 
               (?# an integer, possibly negative )
            |  \'
               (?: \\\ . | \'\' | [^\'\\\] )*
               \'
               (?# a string in single quotes )
            |  NULL
               (?# NULL )
          )
     )+
     (?# one or more such assignments )
     \s+ WHERE \s+ `keyname` \s+ = \s+ \'keyvalue\' $
/x'
Comments