mat kelcey mat kelcey - 7 months ago 15
Java Question

Dealing with incorrectly encoded UTF-16 (?) in Java

I'm doing some work on the common crawl dataset (a large web crawl) and I keep seeing a strange encoding schema I just can't work out how to deal with.

The pattern I'm seeing again and again is something like the sequence of bytes

50 6f 6b e9 6d 6f 6e
which I'm guessing is meant to represent
Pokémon
.

Now encoding schemas aren't my strongest point, but I don't know of any encoding where it's valid to represent the
é
as just
e9
.

It's a bit like [UTF-16][1] which would be
fe ff 00 50 00 6f 00 6b 00 e9 00 6d 00 6f 00 6e


And it's definitely not UTF-8 which would be
50 6f 6b c3 a9 6d 6f 6e


So I'm just after a way in Java to decode these bytes into a string, a library would be ideal.

new String(bytes)
justifiably doesn't work and is rightly converting the
e9
to the replacement character
ef bf bd
(aka the dreaded �)

Any ideas on how to handle these?

update

I've ended up using the character set encoding detector provided in Apache Tika [2]. Works well.

[1] http://www.fileformat.info/info/unicode/char/e9/index.htm

[2] http://tika.apache.org/0.8/api/org/apache/tika/parser/txt/CharsetDetector.html

Answer

That's either ISO-8859-1 or Windows-1252, the latter being essentially a superset of the former. Use either new String(bytes, "ISO-8859-1") or new String(bytes, "Windows-1252").

Comments