I have a static set of ~35000 开发者_开发问答unique ASCII text strings from 20 to 60 bytes each. I want to introduce a unique index in them. Simply numbering would be undesirable for various reasons.
Crypto-grade functions like MD5 work fine, but I feel those are an overkill. This is ultimately for a mobile project, so I'm kinda greedy on both storage and CPU cycles. On the other hand, I've tried 32-bit Adler32 and got collisions.
Can anyone think of a good hash function that produces a 64-bit value?
Because the set of strings that you have is fixed, you should try looking for a perfect hash function, a hash function specifically designed over a set of data to guarantee no collisions occur. There are many tools for creating hash functions like these, one of which, gperf
(not to be confused with gprof
) I know is freely available. I would strongly suggest this.
If you later end up needing to change the set of strings and want a lightweight, simple hash function, you may want to consider using the Rabin-Karp rolling hash function. It can be computed for a string of length n using O(n) additions, multiplications, and moduli, and ensures that each two strings have pairwise independent hash values. Moreover, you could probably code this up in about half an hour to test whether or not it performs better than the Adler checksum.
That said, using a well-known hash function like MD5 is still probably a good idea if you aren't trying to achieve cryptographic security. Even a simple CRC32 might be sufficient in that case.
Given the fact that the likelihood of collisions decreases so much by going from 64 bit to 128 bit, I would strongly consider going with MD5128.
Max entries before X chance of collision
Bits 10e−18 10e−15 10e−12 10e−9 10e−6 0.1% 1% 25% 50% 75%
----------------------------------------------------------------------------------------------
16 2 2 2 2 2 11 36 1.9e2 3.0e2 4.3e2
32 2 2 2 2.9 93 2.9e3 9.3e3 5.0e4 7.7e4 1.1e5
64 6.1 1.9e2 6.1e3 1.9e5 6.1e6 1.9e8 6.1e8 3.3e9 5.1e9 7.2e9
128 2.6e10 8.2e11 2.6e13 8.2e14 2.6e16 8.3e17 2.6e18 1.4e19 2.2e19 3.1e19
256 4.8e29 1.5e31 4.8e32 1.5e34 4.8e35 1.5e37 4.8e37 2.6e38 4.0e38 5.7e38
384 8.9e48 2.8e50 8.9e51 2.8e53 8.9e54 2.8e56 8.9e56 4.8e57 7.4e57 1.0e58
512 1.6e68 5.2e69 1.6e71 5.2e72 1.6e74 5.2e75 1.6e76 8.8e76 1.4e77 1.9e77
So with 35000 (3.5e4) string, with a 64 bit hash, this gives you something between a 10e^-12 and 10e^-9 chance to have a collision. This might not seem very high, but when it comes to hashing, 1 in a billion is pretty easy to hit.
By increasing to 128 bit, you go down to considerably less than 1 in a (billion * billion).
I think you could concatenate the values of two different 32-bit hash functions to get a 64-bit hash.
To get four different hash functions I would use a pre-processing step that alters the input to the hash function in some way that does not commute with the values in the hash function. One way would be to use a 256-byte lookup table to renumber the bytes. Another might be to multiply each byte by X mod 257, replacing anything that yields 256 = -1 mod 257 by -X mod 257, because that won't otherwise occur. Note that (a*256 + b) mod 257 is a + b mod 257.
FWIW there is a non-secure hash function with quite a good guarantee. As an example, pick a prime number and do all your calculations modulo that number, which gives you a mathematical field. Chop your data up into a sequence of numbers modulo that prime, and treat them as the coefficients of a polynomial. As well as picking the modulus for your hash function you pick a number x mod the prime, and then evaluate the polynomial at that x. In theory x is picked at random.
Two messages map to the same value if the difference of their polynomials is zero, which means that the chosen x is a root of that polynomial. A polynomial of degree N has at most N roots, so in your case - if you have quite short strings and pick a large modulus - that's not a bad guarantee. I think I saw this suggested as a quicker way to get a secure hash function if you encrypt the result of this calculation. I think it was supposed to be faster than MD5 because even though doing arithmetic modulo 128-bit primes is expensive, somebody reckoned it was cheaper than doing MD5.
Settled on 64-bit MurmurHash64B. Extra points for the purry sounding name.
精彩评论