André Cardoso André Cardoso - 3 months ago 9
PHP Question

Acents become interrogation marks in php when parsing html

i'm getting a PT-BR text automatically from downloading a html page and the acentution becomes interrogation marks when I use uft8_decode, this is my function:

function pegaMsg($string)
{
$bot_url = "http://website.com";
//&rnd=&msg="
$rand_msg = rand(0,100);
$url = $bot_url . $rand_msg . "&msg=" . $string;
$url = str_replace(" ", "%20", $url);
//echo "\n" . $url;
$download = http_get($url, $referer="");
$download['FILE'] = utf8_decode($download['FILE']);
$download['FILE'] = str_replace("var resp = ", "", $download['FILE']);
$download['FILE'] = str_replace("\\r\\n", "", $download['FILE']);
$download['FILE'] = str_replace(";", "", $download['FILE']);
$download['FILE'] = str_replace("\'", "", $download['FILE']);

$download['FILE'] = trim($download['FILE']);
return $download['FILE'];
}


this is the output expected:


VOCÊ TINHA DUAS ESCOLHAS:


and this is what I get:


'VOC? TINHA DUAS ESCOLHAS:


what can I do ? I want the ^ displayed ! thanks and sorry for the bad english

Answer

utf8_decode replaces invalid code unit sequences ?. The reason you're getting a ? is likely because the text you're passing to utf8_decode was not in UTF-8 to begin with.

In fact, it's possible it was already in ISO-8859-1, which is the encoding of the string returned by utf8_decode. In that case, your solution would be to just omit the call to utf8_decode.

If the original text was neither in UTF-8 nor in ISO-8859-1 (which is what I'm assuming you want, since you're calling utf8_decode), you have to use iconv or mb_convert_encoding.

A final possibility is that whatever is interpreting the script output is assuming the encoding of the script output is different from what it actually and it also converts invalid code unit sequences to ?.