Ivan Zhivolupov Ivan Zhivolupov - 5 months ago 14
PHP Question

No semicolon in encoding

im trying to decode text which is presented in WINDOWS-1251 i believe.
The string looks like this:

&#1040&#1075&#1077&#1085&#1090


Which should represent Agent in Russian. And here is the problem:


  1. I'm not able to convert this string unless i add semicolons after each number

  2. I cant do it manually, because i have like 10000 lines of text to be converted.



So the question is, what is this encoding (without semicolons) and how can i add them automatically to each line (regex maybe?) without breaking the code.

So far, i've been trying to do this by using this code:

App Logic

public function parseSentence((array) $sentences, $sentence, $i) {
if (strstr($sentence, '-')) {
$sentences[$i] = $this->explodeAndSplit('-', $sentence);
} else if (strstr($sentence, "'")) {
$sentences[$i] = $this->explodeAndSplit("'", $sentence);
} else if (strstr($sentence, "(")) {
$sentences[$i] = $this->explodeAndSplit("(", $sentence);
} else if (strstr($sentence, ")")) {
$sentences[$i] = $this->explodeAndSplit(")", $sentence);
} else {
if (strstr($sentence, '#')) {
$sentences[$i] = chunk_split($sentence, 6, ';');
}
return $sentences;
}

/**
* Explode and Split
* @param string $explodeBy
* @param string $string
*
* @return string
*/
private function explodeAndSplit($explodeBy, $string) {
$exp = explode($explodeBy, $string);
for ($j = 0; $j < count($exp); $j++) {
$exp[$j] = chunk_split($exp[$j], 6, ';');
}
return implode($explodeBy, $exp);
}


But obviously, this approach is a bit incorrect (well, totally incorrect), because i'm not taking into account many other 'special' characters. So how can it be fixed?

Update:

I'm using Lumen for backend and AngularJS for frontend. Getting all the data parsed in Lumen (database/text files/etc), providing so called API routes for AngularJS to access and retrieve data. And the thing is, this semicolonless encoding work great in any browser if accessed directly, but fails to be displayed in Angular due to missing semicolons

Answer

These are Russian HTML Codes (Cyrillic). To ensure they are displayed properly, you'll need an appropriate content-type applied:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

Now to do this correctly, you'll want to preg_split() the above string of HTML codes you have, accordingly:

array_filter(preg_split("/[&#]+/", $str));

The array_filter() simply removes any empty values. You could ultamitely use explode() too, to do the same thing.

This will return an array of the numbers you have. From there, a simple implode() with the required prepended &# and appended ; is simple:

echo '&#' .implode( ";&#", array_filter(preg_split("/[&#]+/", $str) )) . ';';

Which returns:

&#1040;&#1075;&#1077;&#1085;&#1090;

Now when generated as correct HTML, it displays the following Russian text:

Агент

Which translates directly to Agent.

Comments