mseifert mseifert - 1 month ago 8
Javascript Question

Javascript regex - why a terminating space is needed to match the whole string

I am parsing sql WHERE clauses and I have the following javascript (regex)

(?:(?:(between )(['"]?)(.*?)(\2)( and )(['"]?)(.*?)(\6)))


That I am matching against

id BETWEEN 3 and 10


In order for this regex to work, I have to add
\s
or
\s+
at the end of the regex and include a space at the end of the string being matched.

Can someone explain why this matching of the extra space is necessary to match the
10
part of the string (in capturing group 7)?

Note that this regex is extracted from a larger regex which is used to parse an sql filter:

(\(*)([\w][\w\d.]*)\s*([<>!=]{1,2}|like|not like|is null|is not null|in\s*\()?\s*(?!and|or)(?:(?:(between )(['"]?)(.*?)(\5)( and )(['"]?)(.*?)(\9))|(?:(['"]?)(.*?)(\12)))\s*(\)*)\s+(?!'|")\s*(and|or)?\s*

Answer

In (?:(?:(between )(['"]?)(.*?)(\2)( and )(['"]?)(.*?)(\6))), the 6th group - (['"]?) - matches an empty string. So, .*? (the 7th group) appears at the end of the pattern, and being a lazy pattern, matches the least amount of characters it can match, that is, zero.

Consider a regex like /I have a .*?/ and you try it against a I have a cat string (see demo here). The regex finds I have a and then the .*? part - matching any zero or more chars other than linebreak chars as few as possible - matches the empty space right before cat because that is how lazy quantifiers work: rather than match eagerly, they let subsequent patterns match, and only when they fail, the lazy pattern will "expand", i.e. will try to match. That is why the lazy patterns at the end of the pattern match the minimal amount of chars they need to match: .+? will match only 1 char, and .*? will match 0.

See Greedy vs. Reluctant vs. Possessive Quantifiers for more information on how lazy quantifiers work.

As you cannot use a backreference to the empty string as a boundary, you will need to use alternation and capture " and ' delimited substrings into 1 capturing group, and a sequence of non-whitespace into another.

Besdies, the \s+ close to the end of the pattern needs to be changed into \s* to allow the string not to end with whitespace.

(\(*)(\w[\w.]*)\s*([<>!=]{1,2}|like|not like|is null|is not null|in\s*\()?\s*(?!and|or)(?:(?:(between )(?:(['"])(.*?)(\5)|(\S+))( and )(?:(['"])(.*?)(\10)|(\S+)))|(?:(['"])(.*?)(\14)|(\S+)))\s*(\)*)\s*(?!'|")\s*(and|or)?\s*

See this regex demo