Consistent implementation of tr?_问答_开发者_运维开发者技术经验分享

I have a ksh script that generates a long, random string using /dev/urandom and tr:

STRING="$(cat /dev/urandom|tr -dc 'a-zA-Z0-9-_'|fold -w 64 |head -1)"

On the Linux and AIX servers where I used this it resulted in 64 characters of upper and lower case alpha chars, digits, dash and underscore characters. Example:

W-uch3_4fbnk34u2nc08w_nj23n089023ncNjxz979823n23-n88h30pmLCxkMKj

When I used the script on Solaris the ranges were interpreted as literals and it resulted in strings from the set aAzZ09-_. Example:

AA0z9_aZ-a-z00aZ9_azAZa0zZza9-Az0-_za-9aa0az_a0z-0a0z000-A9Z_0a

Oddly, on this Solaris server the man page for tr indicates that the syntax used should have produced the desired result.

The idea is to use /dev/urandom to produce a pseudo-random string from which we extract characters so that the result a) does not contain spaces and b) does not contain s开发者_如何学Chell special characters. The string will be used on the command line as an argument later on in the script. We don't want to use classes like :alnum: because locale can convert these into multi-byte values that don't work on the command line. This ksh one-liner did the trick perfectly on a great many installations until we got to Solaris.

We have temporarily converted this to a somewhat nasty Perl regex. Is there a syntax for tr or some other utility or ksh built-in that will perform this task consistently across UNIX variants and is universally installed? Doesn't have to be a one-liner but simplicity is appreciated.

Update: We tried the Locale settings with no luck. Waiting on results of using xpg6 version.

$ uname -a
SunOS hostname 5.10 Generic_142900-04 sun4u sparc SUNW,SPARC-Enterprise
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
0-a9-z9a_zzZAa_a_0az-9_z0a_90Z_9az09aZzZAa-9aa_-__za0ZA9_ZzzZazA
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=en_US
LC_CTYPE=en_US
LC_MESSAGES=en_US
LC_MONETARY=en_US
LC_NUMERIC=en_US
LC_TIME=en_US
$ export LC_CTYPE="$LC_ALL" LC_MESSAGES="$LC_ALL"
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=en_US
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
LC_NUMERIC=en_US
LC_TIME=en_US
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
0900z9az99_a0za09__0zA0_Z--Z_-Aa-AaA9zAZz-Aa90A00z__ZzA9A-Z0aA_-
$ unset LC_ALL; export LC_COLLATE=C LC_NUMERIC=C LC_TIME=C
$ set | grep '^L[AC]'
LANG=C
LC_COLLATE=C
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
LC_NUMERIC=C
LC_TIME=C
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_AA9aA_Za-A0-AZa_A-0ZA--a_za-a9zZZz__a0az_-0A-9-0aA-0za00A-__9-0
$ unset LANG LC_COLLATE LC_NUMERIC LC_TIME
$ set | grep '^L[AC]'
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=en_US
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_-_9zz9Z-Z-Z-Z_0_a9zzzZZaAa--9_zAZaaAZz-ZaAZ09Z-_z-zz09ZZAzAz0Z0
$ unset LC_CTYPE LC_MESSAGES LC_MONETARY
$ set | grep '^L[AC]'
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
_0aAa9_Z_a_Z--_Az-aa0ZA0ZzZ-9Aa9-Z0--0A_Z0Zaz-AA_Zz0z---Z_99z_a9
$ export LANG=C LC_ALL=C LC_COLLATE=C LC_CTYPE=C LC_MESSAGES=C LC_MONETARY=C LC_NUMERIC=C LC_TIME=C
$ set | grep '^L[AC]'
LANG=C
LC_ALL=C
LC_COLLATE=C
LC_CTYPE=C
LC_MESSAGES=C
LC_MONETARY=C
LC_NUMERIC=C
LC_TIME=C
$ cat /dev/urandom | tr -dc "a-zA-Z0-9-_" | fold -w 64 | head -1 | sed 's/^-/_/'
Za_000z9aa--aA00zAAZza0AA90090--z0a00_zZ9ZA0_---aZZ09a0ZA0_0zZaa
$ cat /dev/urandom | tr -dc "[a-z][A-Z][0-9]-_" | fold -w 64 | head -1 | sed 's/^-/_/'
x7dni9gIXVF6AHQc3B-H6hjnBVHChJ9zM-z5EQ5UEruATI_NNFaCoVLOqM6gVaT5
$

Of course, on Linux that last version spits out square brackets.

If you set your path to /usr/xpg6/bin/ then it'll work as expected The locale seems to have no affect here. A cross platform hack is:

tr -dc '[a-z][A-Z][0-9]_-' < /dev/urandom | tr -d '][' | fold -w64 | head -n1

What you've observed is not a different between operating systems, but different machines having different locale settings. Your Solaris machine has LC_COLLATE set to a non-default value, which is a sure recipe for the kind of problems you have.

Locale settings are set from the environment as follows:

If the environment variable LC_ALL is set, its value is used for all categories.
Otherwise, if LC_FOO is set, its value is used for category LC_FOO.
Otherwise, if LANG is set, its value is used for categories that weren't explicitly set.
The default locale is called C. On Unix systems, POSIX is a synonym for C.

The main locale categories are:

LC_CTYPE indicates the character set and encoding used for file names, file contents and terminal I/O. You should carefully preserve this setting unless you know it's inaccurate (e.g. because a particular file format specifies a particular encoding).
LC_MESSAGES is the language of the messages that the user sees. You should preserve this setting. If you really need to parse an error message, set LC_MESSAGES=C.
LC_COLLATE indicates the sorting order of characters. It's nearly always undesirable in scripts. Most values other than C cause trouble, such as A-Z matching lowercase letters.
Occasionally LC_NUMERIC may cause trouble because numbers may be printed with different punctuation, and LC_TIME influences the way some commands show a date and time. The other categories hardly ever matter in scripts.

Here's a reasonable strategy for scripts (warning, typed directly into the browser):

unset LANGUAGE  # a GNU-specific setting
if [ -n "$LC_ALL" ]; then
  export LC_CTYPE="$LC_ALL" LC_MESSAGES="$LC_ALL"
  unset LC_ALL
elif [ -n "$LANG" ]; then
  export LC_COLLATE=C LC_NUMERIC=C LC_TIME=C
else
  unset LC_COLLATE LC_NUMERIC LC_TIME
fi

Standard shell utilities obey the locale settings. Perl doesn't unless you tell it to.

Try: