Shawn Shawn - 4 months ago 47
PHP Question

PHP - dealing with HTML entities that are missing semicolon

I'm trying to write a script to parse a remote RSS feed, and output the result in JSON format.

The raw RSS feed contains HTML entities like


I use
on the raw content first, so that
will generate correct output:

$rss = new DOMDocument();
$feed = array();
foreach ($rss->getElementsByTagName('item') as $node) {
$item = array (
'title' => html_entity_decode($node->getElementsByTagName('title')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'),
'desc' => html_entity_decode($node->getElementsByTagName('description')->item(0)->nodeValue,ENT_COMPAT,'UTF-8'),
'link' => $node->getElementsByTagName('link')->item(0)->nodeValue,
'date' => $node->getElementsByTagName('pubDate')->item(0)->nodeValue,
$feed[] = $item;
$data = array();
foreach($feed as $item){
$data[] = array('url'=>$item['link'],'date'=>date('l, F d, Y g:i A',strtotime($item['date'])),'title'=>$item['title'],'desc'=>$item['desc']);
echo json_encode($data);

It works well except for some HTML entites that are missing semicolons.
won't recognize them.

I'm thinking maybe I can use regex to find and fix those entities without semicolons. But I don't know how to write such code. Any idea?

Or is there any other way to deal with this?


It seems you just want to match &# followed with 4 digits that are not followed with ;. Use


and relace with $0;. See the regex demo.


  • &# - literal sequence &#
  • \d{4} - 4 digits
  • (?!;) - a negative lookahead that fails the match if there is a ; right after the 4 digits.

The $0 in the replacement pattern is the backreference to the whole match value.

PHP snippet:

$re = '~&#\d{4}(?!;)~';
$str = '&#8211&#8210––';
$subst = '$0;';
$result = preg_replace($re, $subst, $str);