What is the best 32bit hash function f开发者_如何学Pythonor relatively short strings?
Strings are tag names that consist of English letters, numbers, spaces and some additional characters (#
, $
, .
, ...). For example: Unit testing
, C# 2.0
.
I am looking for 'best' as in 'minimal collisions', performance is not important for my goals.
I'm not sure if it's the best choice, but here is a hash function for strings:
The Practice of Programming (HASH TABLES, pg. 57)
/* hash: compute hash value of string */
unsigned int hash(char *str)
{
unsigned int h;
unsigned char *p;
h = 0;
for (p = (unsigned char*)str; *p != '\0'; p++)
h = MULTIPLIER * h + *p;
return h; // or, h % ARRAY_SIZE;
}
Empirically, the values 31 and 37 have proven to be good choices for the multiplier in a hash function for ASCII strings.
If performance isn't important, simply take a secure hash such as MD5 or SHA1, and truncate its output to 32 bits. This will give you a distribution of hash codes that's indistinguishable from random.
I'm sorry for the very late reply on this. Earlier this year I composed a page titled Hashing Short Strings which might be helpful in this discussion. In summary, I found that CRC-32 and FNV-1a are superior for hashing short strings. They are efficient and produced widely distributed and collision free hashes in my tests. I was surprised to find that MD5, SHA-1 and SHA-3 produced small numbers of collisions when the output was folded down to 32-bits.
That depends on your hardware.
On modern hardware, i.e. Intel/AMD with SSE4.2 or arm7 you should use the internal _mm_crc32_uxx
intrinsics, as they are optimal for short strings. (For long keys also, but then better use Adler's threaded version, as in zlib)
On old or unknown hardware, either run-time probe for the SSE4.2 or CRC32 feature or just use one if the simple good hash functions. E.g. Murmur2 or City
An overview of quality and performance is here: https://github.com/rurban/smhasher#smhasher
There are also all the implementations. Favored are https://github.com/rurban/smhasher/blob/master/crc32_hw.c and https://github.com/rurban/smhasher/blob/master/MurmurHash2.cpp
If you know the keys in advance, use a perfect hash, not a hash function. E.g. gperf or my phash: https://github.com/rurban/Perfect-Hash#name
Nowadays perfect hash generation via a c compiler is so fast, you can even create them on the fly, and dynaload it.
Use MaPrime2c
hash function:
static const unsigned char sTable[256] =
{
0xa3,0xd7,0x09,0x83,0xf8,0x48,0xf6,0xf4,0xb3,0x21,0x15,0x78,0x99,0xb1,0xaf,0xf9,
0xe7,0x2d,0x4d,0x8a,0xce,0x4c,0xca,0x2e,0x52,0x95,0xd9,0x1e,0x4e,0x38,0x44,0x28,
0x0a,0xdf,0x02,0xa0,0x17,0xf1,0x60,0x68,0x12,0xb7,0x7a,0xc3,0xe9,0xfa,0x3d,0x53,
0x96,0x84,0x6b,0xba,0xf2,0x63,0x9a,0x19,0x7c,0xae,0xe5,0xf5,0xf7,0x16,0x6a,0xa2,
0x39,0xb6,0x7b,0x0f,0xc1,0x93,0x81,0x1b,0xee,0xb4,0x1a,0xea,0xd0,0x91,0x2f,0xb8,
0x55,0xb9,0xda,0x85,0x3f,0x41,0xbf,0xe0,0x5a,0x58,0x80,0x5f,0x66,0x0b,0xd8,0x90,
0x35,0xd5,0xc0,0xa7,0x33,0x06,0x65,0x69,0x45,0x00,0x94,0x56,0x6d,0x98,0x9b,0x76,
0x97,0xfc,0xb2,0xc2,0xb0,0xfe,0xdb,0x20,0xe1,0xeb,0xd6,0xe4,0xdd,0x47,0x4a,0x1d,
0x42,0xed,0x9e,0x6e,0x49,0x3c,0xcd,0x43,0x27,0xd2,0x07,0xd4,0xde,0xc7,0x67,0x18,
0x89,0xcb,0x30,0x1f,0x8d,0xc6,0x8f,0xaa,0xc8,0x74,0xdc,0xc9,0x5d,0x5c,0x31,0xa4,
0x70,0x88,0x61,0x2c,0x9f,0x0d,0x2b,0x87,0x50,0x82,0x54,0x64,0x26,0x7d,0x03,0x40,
0x34,0x4b,0x1c,0x73,0xd1,0xc4,0xfd,0x3b,0xcc,0xfb,0x7f,0xab,0xe6,0x3e,0x5b,0xa5,
0xad,0x04,0x23,0x9c,0x14,0x51,0x22,0xf0,0x29,0x79,0x71,0x7e,0xff,0x8c,0x0e,0xe2,
0x0c,0xef,0xbc,0x72,0x75,0x6f,0x37,0xa1,0xec,0xd3,0x8e,0x62,0x8b,0x86,0x10,0xe8,
0x08,0x77,0x11,0xbe,0x92,0x4f,0x24,0xc5,0x32,0x36,0x9d,0xcf,0xf3,0xa6,0xbb,0xac,
0x5e,0x6c,0xa9,0x13,0x57,0x25,0xb5,0xe3,0xbd,0xa8,0x3a,0x01,0x05,0x59,0x2a,0x46
};
#define PRIME_MULT 1717
unsigned int
maPrime2cHash (unsigned char *str, unsigned int len)
{
unsigned int hash = len, i;
for (i = 0; i != len; i++, str++)
{
hash ^= sTable[( *str + i) & 255];
hash = hash * PRIME_MULT;
}
return hash;
}
and look at www.amsoftware.narod.ru/algo2.html for MaFastPrime, MaRushPrime, etc tests.
You might check out murmurhash2. It is fast, also for small strings, and has a good mixing final step so it is even good mixed for very small strings.
If it's rare that users add new tags, then you can use a perfect hash (http://en.wikipedia.org/wiki/Perfect_hash_function) that's recomputed each time a new tag is added. Of course, without knowing the problem you are really trying to solve, it's guesswork to figure out what you might do.
If your program needs communicate with other system, it is better to use a algorithm which is well known. The quick & dirty way is using first Several characters of md5 hash. You don't need spend hours or days to invent wheels in your project.
The disadvantage is get much much high chance to collisions. However, if your hash is for a time-stamped session, or short life circule task. There is no problem to use that.
精彩评论