ChriX ChriX - 4 months ago 26
Groovy Question

RandomAccesFile and UTF8 line

I use a

object to read an UTF-8 French file. I use the

My Groovy code below:

while ((line = randomAccess.readLine())) {
def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')
long nextRecordPos = randomAccess.getFilePointer()

compareNextRecords(utfLine, randomAccess)

My problem is
are the same: the accented characters stay like é instead of é. No conversion is done.


First of all, this line of code does absolutely nothing. The data is the same. Remove it:

def utfLine = new String(line.getBytes('UTF-8'), 'UTF-8')

According to the Javadoc, RandomAccessFile.readLine() is not aware of character encodings. It reads bytes until it encounters "\r" or "\n" or "\r\n". ASCII byte values are put into the returned string in the normal way. But byte values between 128 and 255 are put into the string literally without interpreting it as being in a character encoding (or you could say this is the raw/verbatim encoding).

There is no method or constructor to set the character encoding in a RandomAccessFile. But it's still valuable to use readLine() because it takes care of parsing for a newline sequence and allocating memory.

The easiest solution in your situation is to manually convert the fake "line" into bytes by reversing what readLine() did, then decode the bytes into a real string with awareness of character encoding. I don't know how to write code in Groovy, so I'll give the answer in Java:

String fakeLine = randomAccess.readLine();
byte[] bytes = new byte[fakeLine.length()];
for (int i = 0; i < fakeLine.length(); i++)
    bytes[i] = (byte)fakeLine.charAt(i);
String realLine = new String(bytes, "UTF-8");