ds77 ds77 - 3 months ago 13
Swift Question

swift utf16 data stream - issue dividing into chunks

can I ask for a help with splitting UTF-16 data stream into chunks?

Unfortunately quite suffering with finding the letter boundaries.

Any help appreciated, spent several evenings already on this, would love to understand the issue.

Java version which works just fine (is there any auto-correction as even when splitting first two bytes the output gives correct string as part2?):

public static void main(String[] args) throws Exception {
String encoding = "UTF-16";
byte[] data = "ČŘŠŤĎŽŇčřšťďňě".getBytes(encoding);

System.out.println("Data size: "+data.length);

for(int index=2; index<= data.length / 2; index+=2)
{
byte[] part1 = java.util.Arrays.copyOfRange(data, 0, index);
byte[] part2 = java.util.Arrays.copyOfRange(data, index, data.length);

assert(part1.length + part2.length == data.length);

System.out.println("--------------------- "+index);

System.out.println(new String(part1, encoding));
System.out.println(new String(part2, encoding));
}
}


Java output:

Data size: 30
--------------------- 2

ČŘŠŤĎŽŇčřšťďňě
--------------------- 4
Č
ŘŠŤĎŽŇčřšťďňě
--------------------- 6
ČŘ
ŠŤĎŽŇčřšťďňě
--------------------- 8
....


Swift (Xcode 8 beta 6, Swift 3) playground code:

import Foundation

let encoding = String.Encoding.utf16
let data = "ČŘŠŤĎŽŇčřšťďňě".data(using: encoding)!

print("Data size: \(data.count)")

for index in stride(from: 2, to: data.count/2, by: 2)
{
let part1 = data.subdata(in: 0..<index)
let part2 = data.subdata(in: index..<data.count)

assert(part1.count + part2.count == data.count)


print("--------------------- \(index)")
print(String(data: part1, encoding: encoding))
print(String(data: part2, encoding: encoding))
}


Swift output:

Data size: 30
--------------------- 2
Optional("")
Optional("ఁ堁态搁ก紁䜁ഁ夁愁攁༁䠁ᬁ")
--------------------- 4
Optional("Č")
Optional("堁态搁ก紁䜁ഁ夁愁攁༁䠁ᬁ")
--------------------- 6
Optional("ČŘ")
Optional("态搁ก紁䜁ഁ夁愁攁༁䠁ᬁ")
--------------------- 8
Optional("ČŘŠ")
Optional("搁ก紁䜁ഁ夁愁攁༁䠁ᬁ")
--------------------- 10
Optional("ČŘŠŤ")
Optional("ก紁䜁ഁ夁愁攁༁䠁ᬁ")
--------------------- 12
Optional("ČŘŠŤĎ")
Optional("紁䜁ഁ夁愁攁༁䠁ᬁ")


If I change swift encoding to String.Encoding.utf8, the output is as expected, but for utf16 and utf32, I do not understand what is going on.

Thanks.

Answer

Short answer: Use utf16LittleEndian or utf16BigEndian encoding to get the expected results:

Data size: 28
--------------------- 2
Optional("Č")
Optional("ŘŠŤĎŽŇčřšťďňě")
--------------------- 4
Optional("ČŘ")
Optional("ŠŤĎŽŇčřšťďňě")
--------------------- 6
Optional("ČŘŠ")
Optional("ŤĎŽŇčřšťďňě")
...

Longer answer: utf16 encoding converts the string to little-endian UTF-16 data, prepended by a byte-order marker:

let data = "abc".data(using: .utf16)!
print(data as NSData) // <fffe6100 62006300>

When the data is split into two parts, the second part has not leading byte order marker anymore:

let part1 = data.subdata(in: 0..<4)
let part2 = data.subdata(in: 4..<8)
print(part1 as NSData, part2 as NSData) // <fffe6100> <62006300>

The part without byte-order marker is converted wrongly, apparently it is a big-endian byte order assumed now:

print(String(data: part1, encoding: .utf16)) // Optional("a")
print(String(data: part2, encoding: .utf16)) // Optional("戀挀")
print(String(data: part2, encoding: .utf16LittleEndian)) // Optional("bc")