hrsetyono hrsetyono - 11 months ago 79
PHP Question

PHP Regex - Remove a specific string after certain pattern

I'm making column shortcodes in WordPress and it always add

after the tag.

So the raw HTML result from dumping the variable looks like this:

<column class="size-5"></p>

I want to delete that lone
with regex, so I made this:

$content = preg_replace("/(?!<column[^<]+)<\/p>/", '', $content);

I matched
while excluding the column tag. Here's the Regexr link.

In regexr (which I assume uses JS syntax), it works perfectly. But in PHP, it matches every single
and remove it.

I have tried many variation for look behind
but doesn't work.

Has anyone experienced this same problem before?


Answer Source

First of all, you should know that manipulating HTML with regex is vulnerable and may not work in 100% cases with arbitrary HTML code. You should only use it when you know what you are doing (you generate the HTML yourself in the unique way, or the HTML provider is known and uses a unqiue approach to HTML escaping, etc.).

Next, you do not need to use any negative lookaheads. The pattern you are using matches any </p> that is not a starting subsequence of <column[^<]+ subpattern, which is always true, and you effectively match any </p>.

In case you want to remove some text that appears in some specific known context, you may rely on capturing what you need and just matching what you want to replace. The only thing to do is to enclose the part of pattern you will need to keep with (...) and use a backreference to that group in the replacement pattern.


$content = preg_replace('/(<column\b[^<]*>)<\/p>/', '$1', $content);

Alternatively, in PCRE, you may use \K operator that omits the whole text matched so far like

$content = preg_replace('/<column\b[^<]*>\s*\K<\/p>/', '', $content);

And you won't have to use any backreferences in the replacement pattern.

I added the \b (word boundary) to make sure column is matched as a whole word. Since it still can match column in column-editor, you might want to repace <column\b[^<]*> with <column(?:\s[^<]*)?>.