wujohn1990 wujohn1990 - 7 months ago 18
Python Question

different results when sorting character vectors in R

I am wondering how the R sorting algorithm works
When sorting character vector

> a=c("aa(150)","aa(1)S")
> sort(a)
[1] "aa(150)" "aa(1)S"
> a=c("aa(150)","aa(1)")
> sort(a)
[1] "aa(1)" "aa(150)"


Doesn't R compare the integer value of the characters one by one from left to right? Why adding a character can change the result?
I thought the sorting is determined by the "5" and ")" characters, and characters after are ignored.

For comparison with Python

In [1]: a=["aa(150)","aa(1)"]
In [2]: sorted(a)
Out[2]: ['aa(1)', 'aa(150)']
In [3]: a=["aa(150)","aa(1)S"]
In [4]: sorted(a)
Out[4]: ['aa(1)S', 'aa(150)']

Answer

Set the locale to a default that will turn off locale-specific sorting in most cases:

Sys.setlocale("LC_COLLATE", "C")
a=c("aa(150)","aa(1)S")
sort(a)
#[1] "aa(1)S"  "aa(150)"

String collation has to be internationally specific due to language differences. From the help for ?sort:

The sort order for character vectors will depend on the collating sequence of the locale in use: see Comparison.

We can then go to ?Comparisons for:

Comparison of strings in character vectors is lexicographic within the strings using the collating sequence of the locale in use: see locales. The collating sequence of locales such as en_US is normally different from C (which should use ASCII) and can be surprising. Beware of making any assumptions about the collation order: e.g. in Estonian Z comes between S and T, and collation is not necessarily character-by-character – in Danish aa sorts as a single letter, after z. In Welsh ng may or may not be a single sorting unit: if it is it follows g.

As mentioned, because each language uses letters in different ways, the locale matters for sorting.

Comments