Dmytro Plekhotkin Dmytro Plekhotkin - 3 months ago 12
Swift Question

What does it mean that string and character comparisons in Swift are not locale-sensitive?

I started learning Swift language and I am very curious What does it mean that string and character comparisons in Swift are not locale-sensitive? Does it mean that all the characters are stored in Swift like UTF-8 characters?

Answer

Comparing Swift strings with < does a lexicographical comparison based on the so-called "Unicode Normalization Form D" (which can be computed with decomposedStringWithCanonicalMapping)

For example, the decomposition of

"ä" = U+00E4 = LATIN SMALL LETTER A WITH DIAERESIS

is the sequence of two Unicode code points

U+0061,U+0308 = LATIN SMALL LETTER A + COMBINING DIAERESIS

For demonstration purposes, I have written a small String extension which dumps the contents of the String as an array of Unicode code points:

extension String {
    var unicodeData : [String] {
        return map(Array(self.unicodeScalars)) {
            NSString(format: "%04X", $0.value)
        }
    }
}

Now lets take some strings, sort them with <:

var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
sort(&someStrings, <)
println(someStrings)
// [a, ã, ă, ä, ǟ, b]

and dump the Unicode code points of each string (in original and decomposed form) in the sorted array:

for str in someStrings {
    println("\(str)  \(str.unicodeData)  \(str.decomposedStringWithCanonicalMapping.unicodeData)")
}

The output

a  [0061]  [0061]
ã  [00E3]  [0061, 0303]
ă  [0103]  [0061, 0306]
ä  [00E4]  [0061, 0308]
ǟ  [01DF]  [0061, 0308, 0304]
b  [0062]  [0062]

nicely shows that the comparison is done by a lexicographic ordering of the Unicode code points in the decomposed form.

This is also true for strings of more than one character, as the following example shows. With

var someStrings = ["ǟψ", "äψ", "ǟx", "äx"]
sort(&someStrings, <)

the output of above loop is

äx  [00E4, 0078]  [0061, 0308, 0078]
ǟx  [01DF, 0078]  [0061, 0308, 0304, 0078]
ǟψ  [01DF, 03C8]  [0061, 0308, 0304, 03C8]
äψ  [00E4, 03C8]  [0061, 0308, 03C8]

which means that

"äx" < "ǟx", but "äψ" > "ǟψ"

(which was at least unexpected for me).

Finally let's compare this with a locale-sensitive ordering, for example swedish:

var someStrings = ["ǟ", "ä", "ã", "a", "ă", "b"]
let locale = NSLocale(localeIdentifier: "se")
sort(&someStrings) {
    (o1, o2) in
    let s1 = o1 as NSString
    let s2 = o2 as NSString
    return s1.compare(s2, options: nil, range: NSMakeRange(0, s1.length), locale: locale)
        == NSComparisonResult.OrderedAscending
}
println(someStrings)
// [a, ă, ä, ǟ, ã, b]

As you see, the result is different from the Swift < sorting.