I'm doing some work on the common crawl dataset (a large web crawl) and I keep seeing a strange encoding schema I just can't work out how to deal with.
The pattern I'm seeing again and again is something like the sequence of bytes
50 6f 6b e9 6d 6f 6e
which I'm guessing is meant to represent
Now encoding schemas aren't my strongest point, but I don't know of any encoding where it's valid to represent the
It's a bit like [UTF-16] which would be
fe ff 00 50 00 6f 00 6b 00 e9 00 6d 00 6f 00 6e
And it's definitely not UTF-8 which would be
50 6f 6b c3 a9 6d 6f 6e
So I'm just after a way in Java to decode these bytes into a string, a library would be ideal.
justifiably doesn't work and is rightly converting the
to the replacement character
ef bf bd
(aka the dreaded �)
Any ideas on how to handle these?
I've ended up using the character set encoding detector provided in Apache Tika . Works well.