Luke Luke - 1 year ago 56
MySQL Question

UTF8 data stored from a Latin1 Connection, how to check if my data is OK?

Latin1 Connection, instead of UTF8

I've had reports recently from those using my website, that they are unable to create content in Chinese or Arabic. This led me to try creating content on my site using Chinese characters and I have observed that the data being stored for these characters is a question mark

I realised from other questions and articles that I've read that I'm probably in 'character set hell'.

It seems that I've been connecting to the database using Entity Framework using a
connection, as this is the default for MySQL, but the columns in my database are

How is my data encoded?

From what I have read of the process, I think that the data in my database is
data from the browser, encoded as
during transit to the database which is then encoded again as
just before it is stored in the database.

I have been using a PHP script to determine whether the data comes out correctly when I set my connection to use
- and when I select data from my database, it doesn't look like there is any difference.

<!DOCTYPE html>
<meta charset="utf-8" />

// Make the connection to the database
$link = mysqli_connect('localhost','root', '', 'mydatabase');

if (!$link) {
die('Could not connect to MySQL: ' . mysql_error());

// Set connection character set to UTF8

echo '<p>Connection OK</p>';

// Request the string from the database
$result = $link->query("SELECT questiontext FROM question WHERE id = 101");

$row = $result->fetch_assoc();

// Display the data
echo "Result: " . $row['questiontext'] . '<br/>';



  • If I request a string of what looks like [A-Z 0-9] characters using a
    connection to existing data, using this PHP script, it displays on the screen just as it did over the
    connection when using Entity Framework, I can't tell the difference. There doesn't appear to be any issue with the data.

  • If I request a string of Chinese characters that appear to have been entered into the database as question marks, it displays as question marks when retrieved over a

I expected that when I connected to the database using a
connection, that the data would be displayed as garbage, as I was previously using a
connection - but it isn't.

I used Entity Framework to interrogate the MySql variables that were being used for the connection before and after adding
to my connection string. Hopefully can give you an idea about how the connection was being established before and how it is now:

Connection before:

Connection before

Connection with connection string charset updated:

Connection after connection string update

How can I determine if the data in the database is encoded incorrectly, whether it is Latin1 data encoded as UTF8 so that I can decide if I can just change my connection string to use UTF8 and everything will work OK?


I've been experimenting by switching the connection type between
and these are my findings...

If I set my connection type to
and output the characters, I end up getting something like this:

Tu es dans une �le d�serte
HEX (bin2hex): 54752065732064616e7320756e6520 ee 6c652064 e9 7365727465203a

If I set my connection to

Tu es dans une île déserte
HEX (bin2hex): 54752065732064616e7320756e6520 c3ae 6c652064 c3a9 7365727465203a

(bold and spacing added by Rick James)

When using a UTF8 connection, there aren't any dodgy looking characters at all - only when I set my connection type to
. This leads me to believe that the encoding of my data is OK, presumably it's just straight forward UTF8.

I can only decipher from this, that Entity Framework has been communicating over a UTF8 connection all along, but I don't know how I can confirm that the data is stored correctly.

Answer Source

For Chinese, you need to tell MySQL to use utf8mb4, not just utf8.

When trying to use utf8/utf8mb4, if you see Question Marks (regular ones, not black diamonds) (? is hex 3F),

  • The bytes to be stored are not encoded as utf8. Fix this.
  • The column in the database is CHARACTER SET utf8mb4. Fix this.
  • Also, check that the connection during reading is utf8mb4.

新浪新闻 is Mojibake for 新浪新闻

When trying to use utf8/utf8mb4, if you see Mojibake, check the following. This discussion also applies to Double Encoding, which is not necessarily visible.

  • The bytes to be stored need to be utf8-encoded.
  • The connection when INSERTing and SELECTing text needs to specify utf8mb4. (set_charset)
  • The column needs to be declared CHARACTER SET utf8mb4. (Check with SHOW CREATE TABLE.)
  • HTML should start with <meta charset=UTF-8>. (You did this.)

To verify, do SELECT col, HEX(col) FROM .... If the hex output for is E696B0, then it is correctly encoded with utf8/utf8mb4. If you get C3A6E28093C2B0, it is "double-encoded". Usually if the hex begins with E or F, it is probably correctly encoded. Also, the hex for a single Chinese character will be 6 or 8 long in all cases. Reference.