开发者

What are the R sorting rules of character vectors?

开发者 https://www.devze.com 2023-03-31 05:45 出处:网络
R sorts character vectors in a sequence which I describe as alphabetic, not ASCII. For example: sort(c(\"dog\", \"Cat\", \"Dog\", \"cat\"))

R sorts character vectors in a sequence which I describe as alphabetic, not ASCII.

For example:

sort(c("dog", "Cat", "Dog", "cat"))
[1] "cat" "Cat" "dog" "Dog"

Three questions:

  1. What is the technically correct terminology to describe this sort order?
  2. I can 开发者_StackOverflownot find any reference to this in the manuals on CRAN. Where can I find a description of the sorting rules in R?
  3. is this any different from this sort of behaviour in other languages like C, Java, Perl or PHP?


Details: for sort() states:

 The sort order for character vectors will depend on the collating
 sequence of the locale in use: see ‘Comparison’.  The sort order
 for factors is the order of their levels (which is particularly
 appropriate for ordered factors).

and help(Comparison) then shows:

 Comparison of strings in character vectors is lexicographicwithin
 the strings using the collating sequence of the locale in use:see
 ‘locales’.  The collating sequence of locales such as ‘en_US’ is
 normally different from ‘C’ (which should use ASCII) and can be
 surprising.  Beware of making _any_ assumptions about the 
 collation order: e.g. in Estonian ‘Z’ comes between ‘S’ and ‘T’,
 and collation is not necessarily character-by-character - in
 Danish ‘aa’ sorts as a single letter, after ‘z’.  In Welsh ‘ng’
 may or may not be a single sorting unit: if it is it follows ‘g’.
 Some platforms may not respect the locale and always sort in
 numerical order of the bytes in an 8-bit locale, or in Unicode
 point order for a UTF-8 locale (and may not sort in the same order
 for the same language in different character sets).  Collation of
 non-letters (spaces, punctuation signs, hyphens, fractions and so
 on) is even more problematic.

so it depends on your locale setting.


Sorting depends on locale. My solution for that is the following...

I create ~/.Renviron file

cat ~/.Renviron 
#LC_ALL=C

then in R sorting is in C locale

x=c("A", "B", "d", "F", "g", "H")
sort(x)
#[1] "A" "B" "F" "H" "d" "g"
0

精彩评论

暂无评论...
验证码 换一张
取 消