Steve Steve - 4 months ago 47x
PHP Question

utf8 encoding breaks when upgrading from php5.6 to php7.0

I have a simple (custom) CMS accepting markdown and displaying it in in a web page. Works fine in php5.6 (using the ondrej/php5 ppa on ubuntu 15.10). Mysql collation set to utf8 everywhere.

Upgrade the server to php7.0 (ondrej/php) and it displays garbage characters. I tried migrating the relevant mysql tables and fields to utf8mb4 / utf8mb4_unicode_ci with no luck.

Downgrade to php5.6 and it all works fine.
I have a hunch it is some strange php setting I don't know about? php.ini default_collation=UTF-8. Couldn't find anything else that worked. phpMyAdmin shows garbage no matter what version of php or server settings, so it is not much help.
What could i try next?

Source text (copied from php5.6 rendered page)

아동 보호 정책에 대한 규정
This Code is part of the

Rendered output (from php7 and phpMyAdmin)

ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì •
This Code is part of the


Use this to change a table to utf8mb4:

ALTER TABLE tbl CONVERT TO CHARACTER SET utf8mb4 COLLATION utf8mb4_unicode_520_ci;

However, if the table was already messed up, then this won't fix it. Do the following to verify:

SELECT col, HEX(col) FROM tbl WHERE ...

For example, 아동 보호 정책에 대한 규정 will show a hex of EC9584 EB8F99 EBB3B4 ED98B8 ECA095 ECB185 EC9790 EB8C80 ED959C EAB79C ECA095. (Please ignore the spaces.)

For Korean text, you should see (mostly) groups of 3 hex bytes of the form Ewxxyy, where w is A or B or C or D, as shown in the example above. Hex 20 (only 1 byte) represents a space.

ì•„ë™ ë³´í˜¸ ì •ì±…ì— ëŒ€í•œ ê·œì • is the Mojibake for it. This implies that somewhere latin1 was erroneously involved, probably when you INSERTed the text. In that case, you will see something like C3AC E280A2 E2809E C3AB C28F E284A2 C3AB C2B3 C2B4 C3AD CB9C C2B8 ... -- mostly 2-byte Cwxx hex.

If you see that, an UPDATE of something like this will repair the data: CONVERT(BINARY(CONVERT(CONVERT(col USING utf8mb4) USING latin1)) USING utf8mb4) (Edit: removed call to UNHEX.)