Igor Yavych Igor Yavych - 5 months ago 10
PHP Question

Complex RegEx replacement

I'm trying to write a highlighting functionality. There are two types of highlighting: positive and negative. Positive is done first. Highlighting in itself is very simple - just wrapping keyword/phrase in a

span
with a specific class, which depends on type of highlighting.

Problem:

Sometimes, negative highlighting can contain positive one.

Example:

Original text:


some data from blahblah test was not statistically valid


After text passes through positive highlighting "filter", it'll end up like this:

some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically valid</span>


or

some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>


Then in negative list, we have a phrase
not statistically valid
.

In both cases, resulting text after passing through both "filters" should look like:

some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>


Conditions:

- Amount of
span
tags or their location within keyword/phrase from negative "filter" list is unknown

- Keyword/phrase must be matched even if it includes
span
tags (including right before and right after keyword/phrase). These
span
tags have to be removed.

- If any
span
tags are detected, amount of opening and closing
span
tags removed has to be equal.

Questions:

- How to detect these
span
tags if there are any?

- Is this even possible with RegEx alone?

Answer

I don't think if it can be done with a single Regular Expression and if it's possible, then honestly I'm so lazy for blowing my mind to make it.

In what way do you think?

I came to a solution that takes 4 steps to achieve what you desire:

  1. Extract all words from HTML and store them with their corresponding positions.
  2. Explode each Negative List's value to words
  3. Match each word consecutively to the (1) in-order-words and store
  4. Replace recently found values with their new HTML wrapper (<span class="negative">...</span>) by their positions

I have, however, made a detailed flowchart (I'm not good at flowchats, sorry) that you feel better in understanding things. It could help if you look at codes at the first.

enter image description here

Here is what we have:

$HTML = <<< HTML
some data from <span class="positive">blahblah test</span> was not <span class="positive">statistically <span class="positive">valid</span></span>
HTML;

$listOfNegatives = ['not statistically valid'];

To extract words (real words) I used a RegEx which will fulfill our needs at this step:

~\b(?<![</])\w+\b(?![^<>]+>)~

To get positions of each word too, a flag should be used with preg_match_all(): PREG_OFFSET_CAPTURE

/**
 * Extract all words and their corresponsing positions
 * @param  [string] $HTML
 * @return [array]  $HTMLWords
 */
function extractWords($HTML) {
    $HTMLWords = [];
    preg_match_all("~\b(?<![</])\w+\b(?![^<>]+>)~", $HTML, $words, PREG_OFFSET_CAPTURE);
    foreach ($words[0] as $word) {
        $HTMLWords[$word[1]] = $word[0];
    }
    return $HTMLWords;
}

This function's output is something like this:

Array
(
    [0] => some
    [5] => data
    [10] => from
    [38] => blahblah
    [47] => test
    [59] => was
    [63] => not
    [90] => statistically
    [127] => valid
)

What we should do here is to match each words of a list's value - consecutively - to words we just extracted. So as our first list's value not statistically valid we have three words not, statistically and valid and these words should come continuously in the extracted words array. (which happens)

To handle this I wrote a function:

/**
 * Check if any of our defined list values can be found in an ordered-array of exctracted words
 * @param  [array] $HTMLWords
 * @param  [array] $listOfNegatives
 * @return [array] $subString
 */
function checkNegativesExistence($HTMLWords, $listOfNegatives) {
    $counter = 0;
    $previousWordOffset = null;
    $subStrings = [];

    foreach ($listOfNegatives as $i => $string) {
        $stringWords = explode(" ", $string);
        $wordIndex = 0;
        foreach ($HTMLWords as $offset => $HTMLWord) {
            if ($wordIndex > count($stringWords) - 1) {
                $wordIndex = 0;
                $counter++;
            }

            if ($stringWords[$wordIndex] == $HTMLWord) {
                $subStrings[$counter][] = [$HTMLWord, $offset, $previousWordOffset];
                $wordIndex++;

            } elseif (isset($subStrings[$counter]) && count($subStrings[$counter]) > 0) {
                unset($subStrings[$counter]);
                $wordIndex = 0;
            }
            $previousWordOffset = $offset + strlen($HTMLWord);
        }
        $counter++;
    }

    return $subStrings;
}

Which has an output like below:

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => not
                    [1] => 63
                )

            [1] => Array
                (
                    [0] => statistically
                    [1] => 90
                )

            [2] => Array
                (
                    [0] => valid
                    [1] => 127
                )
        )
)

If you see we have a complete string split into words and their offsets. We need them later.

Now another thing we should consider is to replace this occurrence from offset 63 to 127 + strlen(valid) with <span class="negative">not statistically valid</span> and forget about every thing else.

/**
 * Substitute newly matched strings with negative HTML wrapper
 * @param  [array]  $subStrings
 * @param  [string] $HTML
 * @return [string] $HTML
 */
function negativeHighlight($subStrings, $HTML) {
    $offset = 0;
    $HTMLLength = strlen($HTML);

    foreach ($subStrings as $key => $value) {
        $arrayOfWords = [];
        foreach ($value as $word) {
            $arrayOfWords[] = $word[0];
            if (current($value) == $value[0]) {
                $start = substr($HTML, $word[1], strlen($word[0])) == $word[0] ? $word[2] : $word[2] + $offset;
            }
            if (current($value) == end($value)) {
                $defaultLength = $word[1] + strlen($word[0]) - $start;
                $length = substr($HTML, $word[1], strlen($word[0])) === $word[0] ? $defaultLength : $defaultLength + $offset;
            }
        }
        $string = implode(" ", $arrayOfWords);
        $HTML = substr_replace($HTML, "<span class=\"negative\">{$string}</span>", $start, $length);

        if ($HTMLLength > strlen($HTML)) {
            $offset = -($HTMLLength - strlen($HTML));
        } elseif ($HTMLLength < strlen($HTML)) {
            $offset = strlen($HTML) - $HTMLLength;
        }
    }

    return $HTML;
}

An important thing here I should note is that by doing first substitution we may affect offsets of other extracted values (that we don't have here). So calculating new HTML length is required:

if ($HTMLLength > strlen($HTML)) {
    $offset = -($HTMLLength - strlen($HTML));
} elseif ($HTMLLength < strlen($HTML)) {
    $offset = strlen($HTML) - $HTMLLength;
}

Doing all together:

$newHTML = negativeHighlight(checkNegativesExistence(extractWords($HTML), $listOfNegatives), $HTML);

Output:

some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span></span></span>

But there are problems with our last output: unmatched tags.

I'm sorry that I lied I've done this problem solving in 4 steps but it has one more. Here I made another RegEx to match all truly nested tags and those which are mistakenly existed:

~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~

By a preg_replace_callback() I only replace tags in group named single with nothing:

echo preg_replace_callback("~(<span[^>]+>([^<]*+<(?!/)(?:([a-zA-Z0-9]++)[^>]*>[^<]*</\3>|(?2)))*[^<]*</span>|(?'single'</[^>]+>|<[^>]+>))~",
    function ($match) {
        if (isset($match['single'])) {
            return null;
        }
        return $match[1];
    },
    $newHTML
);

and we have right output:

some data from <span class="positive">blahblah test</span> was <span class="negative">not statistically valid</span>

Failing cases

My solution does not output right HTML on below situations:

1- If a word like <was> is between other words:

<span class="positive">blahblah test</span> <was> not

Why?

  • Because my last RegEx spots <was> as an unmatched tag so it will
    remove it.

2- If a word like not (which is part of a negative list's value in our list) is enclosed with <> -> <not>. Which outputs:

some data from <span class="positive">blahblah test</span> was <not> <span class="positive">statistically <span class="positive">valid</span></span>

Why?

  • Because my first RegEx understands words which are not between tags specific characters <>

3- If list has values that one is the other's substring:

$listOfNegatives = ['not statistically valid', 'not statistically'];

Why?

  • Because they overlap.

Working demo