greduan greduan - 1 month ago 6
PHP Question

Matching all three kinds of PHP comments with REGEX

I'm new to REGEX and I need some help.

I need to match all three types of comments that PHP might have:

# Single line comment


// Single line comment


/* Multi-line comments */



/**
* And all of it's possible variations
*/


Something I should mention, I am doing this in order to be able to recognize if a PHP closing tag (
?>
) is inside a comment or not, if it is then ignore it, if not then make it count as one. This is gonna be used inside an XML document in order to improve Sublime Text's recognition of the closing tag (cause it's driving me nuts!). I tried to achieve this a couple hours but wasn't able, so if you could translate for it to work with XML I would appreciate it. :)

So if you could also include the if-then-else login I would really appreciate it. BTW, I really need it to be in pure REGEX expression, no language features or anything. :)

Like Eicon reminded me, I need all of them to be able to match at the start of the line, or at the end of a piece of code, so I also need the following with all of them:

<?php
echo 'something'; # this is a comment
?>


Any help would be appreciated. :)

Answer

Parsing a programming language seems too much for regexes to do. You should probably look for a PHP parser.

But these would be the regexes you are looking for. I assume for all of them that you use the DOTALL or SINGLELINE option (although the first two would work without it as well):

~#[^\r\n]*~
~//[^\r\n]*~
~/\*.*?\*/~s

Note that any of these will cause problems, if the comment-delimiting characters appear in a string or somewhere else, where they do not actually open a comment.

You can also combine all of these into one regex:

~(?:#|//)[^\r\n]*|/\*.*?\*/~s

If you use some tool or language that does not require delimiters (like Java or C#), remove those ~. In this case you will also have to apply the DOTALL option differently. But without knowing where you are going to use this, I cannot tell you how.

If you cannot/do not want to set the DOTALL option, this would be equivalent (I also left out the delimiters to give an example):

(?:#|//)[^\r\n]*|/\*[\s\S]*?\*/

See here for a working demo.

Now if you also want to capture the contents of the comments in a group, then you could do this

(?|(?:#|//)([^\r\n]*)|/\*([\s\S]*?)\*/)

Regardless of the type of comment, the comments content (without the syntax delimiters) will be found in capture 1.

Another working demo.

Comments