I'm scraping from a UTF-8 site, using Goutte, which internally uses Guzzle. The site declares a meta tag of UTF-8, thus:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
Content-Type: text/html
Content-Type: text/html; charset=utf-8
<?php
require_once realpath(__DIR__ . '/..') . '/vendor/goutte/goutte.phar';
$url = 'http://crawler-tests.local/utf-8.php';
use Goutte\Client;
$client = new Client();
$crawler = $client->request('get', $url);
$text = $crawler->text();
echo 'Whole page: ' . $text . "\n";
<?php
// Correct
#header('Content-Type: text/html; charset=utf-8');
// Incorrect
header('Content-Type: text/html');
?>
<!DOCTYPE html>
<html>
<head>
<title>UTF-8 test</title>
<meta charset="utf-8" />
</head>
<body>
<p>When the Content-Header header is incomplete, the pound sign breaks:
£15,216</p>
</body>
</html>
Whole page: UTF-8 test
When the Content-Header header is incomplete, the pound sign breaks: £15,216
The issue is actually with symfony/browser-kit and symfony/domcrawler. The browserkit's Client
does not examine the HTML meta tags to determine the charset, content-type header only. When the response body is handed over to the domcrawler, it is treated as the default charset ISO-8859-1. After examining the meta tags that decision should be reverted and the DomDocument rebuilt, but that never happens.
The easy workaround is to wrap $crawler->text()
with utf8_decode()
:
$text = utf8_decode($crawler->text());
This works if the input is UTF-8. I suppose for other encodings something similar can be achieved with iconv()
or so. However, you have to remember to do that every time you call text()
.
A more generic approach is to make the Domcrawler believe that it deals with UTF-8. To that end I've come up with a Guzzle plugin that overwrites (or adds) the charset in the content-type response header. You can find it at https://gist.github.com/pschultz/6554265. Usage is like this:
<?php
use Goutte\Client;
$plugin = new ForceCharsetPlugin();
$plugin->setForcedCharset('utf-8');
$client = new Client();
$client->getClient()->addSubscriber($plugin);
$crawler = $client->request('get', $url);
echo $crawler->text();