Asenar Asenar - 2 months ago 13
PHP Question

preg_match a php string with simple or double quotes escaped inside

I want to parse some php files containing something like this :

// form 1
__('some string');
// form 2
__('an other string I\'ve written with a quote');
// form 3
__('an other one
multiline');
// form 4
__("And I want to handle double quotes too !");
// form 5
__("And I want to handle double quotes too !", $second_parameter_may_happens);


The following regex match everything except the 2nd one

preg_match_all('#__\((\'|")(.*)\1(?:,.*){0,1}\)#smU', $file_content);

Answer

You can use this pattern:

$pattern = '~__\((["\'])(?<param1>(?>[^"\'\\\]++|\\\.|(?!\1)["\'])*)\1(?:,\s*(?<param2>\$[a-z0-9_-]+))?\);~si';

if (preg_match_all($pattern, $data, $matches, PREG_SET_ORDER))
    print_r($matches);

But as Jon notices it, this kind of pattern may be difficult to maintain. This is the reason why, i suggest to change the pattern to this:

$pattern = <<<'LOD'
~
## definitions
(?(DEFINE)
    (?<sqc>        # content between single quotes
        (?> [^'\\]+  | \\. )*
    )
    (?<dqc>        # content between double quotes
        (?> [^"\\]+  | \\. )*
    )
    (?<var>        # variable
        \$ [a-zA-Z0-9_-]+
    )
)

## main pattern
__\(
(?| " (?<param1> \g<dqc> ) " | ' (?<param1> \g<sqc> ) ' )

(?:, \s* (?<param2> \g<var> ) )?
\);
~xs
LOD;

This simple change makes your pattern more readable and editable.

The content between quotes subpatterns have been designed to deal with escaped quotes. The idea is to match all character preceded by a backslash (that can be a backslash itself) to ensure to match literal backslashes and escaped quotes::

\'           # an escaped quote 
\\'          # an escaped backslash and a quote
\\\'         # an escaped backslash and an escaped quote
\\\\'        # two escaped backslashes and a quote
...

subpattern details:

(?>            # open an atomic group (inside which the bactracking is forbiden)
    [^'\\]+    # all that is not a quote or a backslash
  |            # OR
    \\.        # an escaped character
)*             # repeat the group zero or more times
Comments