R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.
For example:
sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"
Three questions:
- What is the technically correct terminology to describe this sort order?
- I can 开发者_StackOverflownot find any reference to this in the manuals on CRAN. Where can I find a description of the sorting rules in R?
- is this any different from this sort of behaviour in other languages like C, Java, Perl or PHP?
Details:
for sort()
states:
The sort order for character vectors will depend on the collating sequence of the locale in use: see ‘Comparison’. The sort order for factors is the order of their levels (which is particularly appropriate for ordered factors).
and help(Comparison)
then shows:
Comparison of strings in character vectors is lexicographicwithin the strings using the collating sequence of the locale in use:see ‘locales’. The collating sequence of locales such as ‘en_US’ is normally different from ‘C’ (which should use ASCII) and can be surprising. Beware of making _any_ assumptions about the collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’, and collation is not necessarily character-by-character - in Danish ‘aa’ sorts as a single letter, after ‘z’. In Welsh ‘ng’ may or may not be a single sorting unit: if it is it follows ‘g’. Some platforms may not respect the locale and always sort in numerical order of the bytes in an 8-bit locale, or in Unicode point order for a UTF-8 locale (and may not sort in the same order for the same language in different character sets). Collation of non-letters (spaces, punctuation signs, hyphens, fractions and so on) is even more problematic.
so it depends on your locale setting.
Sorting depends on locale. My solution for that is the following...
I create ~/.Renviron
file
cat ~/.Renviron
#LC_ALL=C
then in R sorting is in C locale
x=c("A", "B", "d", "F", "g", "H")
sort(x)
#[1] "A" "B" "F" "H" "d" "g"
精彩评论