Al-Jazary Al-Jazary - 5 months ago 39
HTML Question

PHP preg_split on spaces, but not within tags

i am using

preg_split("/\"[^\"]*\"(*SKIP)(*F)|\x20/", $input_line);
and run it on phpliveregex.com
it produce array :

array(10
0=><b>test</b>
1=>or
2=><em>oh
3=>yeah</em>
4=>and
5=><i>
6=>oh
7=>yeah
8=></i>
9=>"ye we 'hold' it"
)


NOT what i want, it should be seperate by spaces only outside html tags like this:

array(5
0=><b>test</b>
1=>or
2=><em>oh yeah</em>
3=>and
4=><i>oh yeah</i>
5=>"ye we 'hold' it"
)


in this regex i am only can add exception in "double quote" but realy need help to add more, like tag
<img/><a></a><pre></pre><code></code><strong></strong><b></b><em></em><i></i>


any explanation about how that regex works also appreciate.

Answer

It's easier to use the DOMDocument since you don't need to describe what a html tag is and how it looks. You only need to check the nodeType. When it's a textNode, split it with preg_match_all (it's more handy than to design a pattern for preg_split):

$html = 'spaces in a text node <b>test</b> or <em>oh yeah</em> and <i>oh yeah</i>
"ye we \'hold\' it"
"unclosed double quotes at the end';

$dom = new DOMDocument;
$dom->loadHTML('<div>' . $html . '</div>', LIBXML_HTML_NOIMPLIED);

$nodeList = $dom->documentElement->childNodes;

$results = [];

foreach ($nodeList as $childNode) {
    if ($childNode->nodeType == XML_TEXT_NODE &&
        preg_match_all('~[^\s"]+|"[^"]*"?~', $childNode->nodeValue, $m))
        $results = array_merge($results, $m[0]);
    else
        $results[] = $dom->saveHTML($childNode);
}

print_r($results);

Note: I have chosen a default behaviour when a double quote part stays unclosed (without a closing quote), feel free to change it.

Note2: Sometimes LIBXML_ constants are not defined. You can solve this problem testing it before and defining it when needed:

if (!defined('LIBXML_HTML_NOIMPLIED'))
    define('LIBXML_HTML_NOIMPLIED', 8192);
Comments