开发者

uniform 16bit hash function for strings

开发者 https://www.devze.com 2023-03-29 16:50 出处:网络
I have about 50,000 words that I want to map each of them to a 16bit number and I\'m seeking for a hash function to run on j2me.

I have about 50,000 words that I want to map each of them to a 16bit number and I'm seeking for a hash function to run on j2me. To be more specific I'm looking for a hash function with below criteria:

  1. few (or no) collisions
  2. light CPU load
  3. I have all of the words now
  4. Avalanche effect is not important, since it's not about security. It' just a look-up table.

I've tested java St开发者_StackOverflowrign.hashCode(), murmur hash, jenkins one at a time and a few simple hand-made ones but all of them have at least 30% collisions.

The minimal perfect hashing seems to have heavy CPU load for a small mobile phone too.

Can anybody help me with this?

note: As you know murmur algorithm needs a seed and different seeds have different uniformity. How can I find the seed with minimum collisions?

Thank you in advance


You could look into an old-fashioned CRC. They are very fast and reasonably collision-free. Just not quite at 16 bits, as this experiment seems to indicate. But nevertheless, you might give it a try, maybe it's good enough for your purposes.


This answer might be late, but for reference, MurmurHash 3 is sufficiently fast to satisfy your speed criteria. However due to the constraint you have imposed, collisions will be quite common, since 16 bits can represent a range of 65536 values, your 50000 words would create some collisions.

Solutions:

  • use 20+ bits for the key (with 32 bits, there is one collision in a few million samples)
  • write a test program to find a fit seed for 16 bits, here are some useful tools: http://code.google.com/p/smhasher/


Here is the function I use in C# to map file name to 16 bit number. In my tests it performed better than Pearson hashing.

    public static unsafe int Get16BitHash(string str)
    {
        int hash = 0;
        int len = str.Length;

        fixed (char* ch = str)
        {
            for (int i = 0; i < len; i++)
            {
                hash = hash + ((hash) << 5) + *(ch + i) + ((*(ch + i)) << 7);
            }
        }

        return ((hash) ^ (hash >> 16)) & 0xffff;
    }
0

精彩评论

暂无评论...
验证码 换一张
取 消