John Poe John Poe - 7 months ago 41
Scala Question

Scala convert string between two charsets

I have a misformed UTF-8 string consisting that should be written "Michèle Huà" but outputs as "Michèle HuÃ"

According to this table it is a problem between Windows-1252 and UTF-8

How do I make conversion?

scala>"Michèle HuÃ".getBytes(), "ISO-8859-1").mkString
res25: String = Michèle HuÃ

scala>"Michèle HuÃ".getBytes(), "UTF-8").mkString
res26: String = Michèle HuÃ

scala>"Michèle HuÃ".getBytes(), "Windows-1252").mkString
res27: String = Michèle HuÃ

Thank you


You don't actually have the complete string there, due to an unfortunate issue with one character printing blank. "Michèle Huà" when encoded as UTF-8 but read as Windows-1252 is actually "Michèle Huà", where that last character is 0xA0 (but typically pastes as 0x20, a space).

If you can include that character, you can convert successfully.

scala> fixed = new String("Michèle HuÃ\u00A0".getBytes("Windows-1252"), "UTF-8")
fixed: String = Michèle Huà