mat kelcey mat kelcey - 5 months ago 9x
Java Question

Dealing with incorrectly encoded UTF-16 (?) in Java

I'm doing some work on the common crawl dataset (a large web crawl) and I keep seeing a strange encoding schema I just can't work out how to deal with.

The pattern I'm seeing again and again is something like the sequence of bytes

50 6f 6b e9 6d 6f 6e
which I'm guessing is meant to represent

Now encoding schemas aren't my strongest point, but I don't know of any encoding where it's valid to represent the
as just

It's a bit like [UTF-16][1] which would be
fe ff 00 50 00 6f 00 6b 00 e9 00 6d 00 6f 00 6e

And it's definitely not UTF-8 which would be
50 6f 6b c3 a9 6d 6f 6e

So I'm just after a way in Java to decode these bytes into a string, a library would be ideal.

new String(bytes)
justifiably doesn't work and is rightly converting the
to the replacement character
ef bf bd
(aka the dreaded �)

Any ideas on how to handle these?


I've ended up using the character set encoding detector provided in Apache Tika [2]. Works well.




That's either ISO-8859-1 or Windows-1252, the latter being essentially a superset of the former. Use either new String(bytes, "ISO-8859-1") or new String(bytes, "Windows-1252").