Ahmad Hajjar Ahmad Hajjar - 3 months ago 10
PHP Question

RegEx BackReference to Match Different Values

I have a regex that I use to match Expression of the form

(val1 operator val2)


This regex looks like :

(\(\s*([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)\s*([a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*\))


Which is actually good and matches what I want as you can see here in this demo

BUT :D (here comes the butter)

I want to optimise the regex itself by making it more readable and "Compact". I searched on how to do that and I found something called back-reference, in which you can name your capturing groups and then reference them later as such:

(\(\s*(?P<Val>[a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|\*|\/|\+|\-|==|!=|>|>=|<|<=)\s*(\g{Val})\s*\))


where I named the group that captures the left side of the expression
Val
and later I referenced it as
(\g{Val})
, now the problem is that this expression as you can see here only case where left side of the expression is exactly the same as right side! e.g.
(a==a)
or
(1==1)
and does not match expressions such as
(a==b)
!

Now the question is: is there a way to reference the pattern instead of the matched value?!

Answer

Note that \g{N} is equivalent to \1, that is, a backreference that matches the same value, not the pattern, that the corresponding capturing group matched. This syntax is a bit more flexible though, since you can define the capture groups that are relative to the current group by using - before the number (i.e. \g{-2}, (\p{L})(\d)\g{-2} will match a1a).

The PCRE engine allows subroutine calls to recurse subpatterns. To repeat the pattern of Group 1, use (?1), and (?&Val) to recurse the pattern of the named group Val.

Also, you may use character classes to match single characters, and consider using ? quantifier to make parts of the regex optional:

(\(\s*(?P<Val>[a-zA-Z]+[0-9]*|[0-9]+|\'.*\'|\[.*\])\s*(ni|in|[*\/+-]|[=!><]=|[><])\s*((?&Val))\s*\))

See the regex demo

Note that \'.*\' and \[.*\] can match too much, consider replacing with \'[^\']*\' and \[[^][]*\].