开发者

What is a good 64bit hash function in Java for textual strings?

开发者 https://www.devze.com 2022-12-10 11:41 出处:网络
I\'m looking for a hash function that: Hashes textual strings well (e.g. few collisions) Is written in Java, and widely used

I'm looking for a hash function that:

  1. Hashes textual strings well (e.g. few collisions)
  2. Is written in Java, and widely used
  3. Bo开发者_高级运维nus: works on several fields (instead of me concatenating them and applying the hash on the concatenated string)
  4. Bonus: Has a 128-bit variant.
  5. Bonus: Not CPU intensive.


Why don't you use a long variant of the default String.hashCode() (where some really smart guys certainly put effort into making it efficient - not mentioning the thousands of developer eyes that already looked at this code)?

// adapted from String.hashCode()
public static long hash(String string) {
  long h = 1125899906842597L; // prime
  int len = string.length();

  for (int i = 0; i < len; i++) {
    h = 31*h + string.charAt(i);
  }
  return h;
}

If you're looking for even more bits, you could probably use a BigInteger Edit:

As I mentioned in a comment to the answer of @brianegge, there are not much usecases for hashes with more than 32 bits and most likely not a single one for hashes with more than 64 bits:

I could imagine a huge hashtable distributed across dozens of servers, maybe storing tens of billions of mappings. For such a scenario, @brianegge still has a valid point here: 32 bit allow for 2^32 (ca. 4.3 billion) different hash keys. Assuming a strong algorithm, you should still have quite few collisions. With 64 bit (18,446,744,073 billion different keys) your certainly save, regardless of whatever crazy scenario you need it for. Thinking of usecases for 128 bit keys (340,282,366,920,938,463,463,374,607,431 billion possible keys) is pretty much impossible though.

To combine the hash for several fields, simply do an XOR multiply one with a prime and add them:

long hash = MyHash.hash(string1) * 31 + MyHash.hash(string2);

The small prime is in there to avoid equal hash code for switched values, i.e. {'foo','bar'} and {'bar','foo'} aren't equal and should have a different hash code. XOR is bad as it returns 0 if both values are equal. Therefore, {'foo','foo'} and {'bar','bar'} would have the same hash code.


An answer for today (2018). SipHash.

It will be much faster than most of the answers here, and significantly higher quality than all of them.

The Guava library has one: https://google.github.io/guava/releases/23.0/api/docs/com/google/common/hash/Hashing.html#sipHash24--


Create an SHA-1 hash and then mask out the lowest 64bits.


long hash = string.hashCode();

Yes, the top 32 bits will be 0, but you will probably run out of hardware resources before you run into problems with hash collisions. The hashCode in String is quite efficient and well tested.

Update I think the above satisfies the simplest thing which could possibly work, however, I agree with @sfussenegger idea of extending the existing String hashCode.

In addition to having a good hashCode for your String, you may want to consider rehashing the hash code in your implementation. If your storage is used by other developers, or used with other types, this can help distributed your keys. For example, Java's HashMap is based on power-of-two length hash tables, so it adds this function to ensure the lower bits are sufficiently distributed.

    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);


Why not use a CRC64 polynomial. These are reasonably efficient and optimized to make sure all bits are counted and spread over the result space.

There are plenty of implementations available on the net if you google "CRC64 Java"


Do something like this:

import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.math.BigInteger;
import java.security.MessageDigest;
import java.security.NoSuchAlgorithmException;

public class Test {

    public static void main(String[] args) throws NoSuchAlgorithmException,
            IOException {
        ByteArrayOutputStream baos = new ByteArrayOutputStream();
        DataOutputStream dos = new DataOutputStream(baos);

        try {
            MessageDigest md = MessageDigest.getInstance("MD5");
            SomeObject testObject = new SomeObject();

            dos.writeInt(testObject.count);
            dos.writeLong(testObject.product);
            dos.writeDouble(testObject.stdDev);
            dos.writeUTF(testObject.name);
            dos.writeChar(testObject.delimiter);
            dos.flush();

            byte[] hashBytes = md.digest(baos.toByteArray());
            BigInteger testObjectHash = new BigInteger(hashBytes);

            System.out.println("Hash " + testObjectHash);
        } finally {
            dos.close();
        }
    }

    private static class SomeObject {
        private int count = 200;
        private long product = 1235134123l;
        private double stdDev = 12343521.456d;
        private String name = "Test Name";
        private char delimiter = '\n';
    }
}

DataOutputStream lets you write primitives and Strings and have them output as bytes. Wrapping a ByteArrayOutputStream in it will let you write to a byte array, which integrates nicely with MessageDigest. You can pick from any algorithm listed here.

Finally BigInteger will let you turn the output bytes into an easier-to-use number. The MD5 and SHA1 algorithms both produce 128-bit hashes, so if you need 64 you can just truncate.

SHA1 should hash almost anything well, and with infrequent collisions (it's 128-bit). This works from Java, but I'm not sure how it's implemented. It may actually be fairly fast. It works on several fields in my implementation: just push them all onto the DataOutputStream and you're good to go. You could even do it with reflection and annotations (maybe @HashComponent(order=1) to show which fields go into a hash and in what order). It's got a 128-bit variant and I think you'll find it doesn't use as much CPU as you think it will.

I've used code like this to get hashes for huge data sets (by now probably billions of objects) to be able to shard them across many backend stores. It should work for whatever you need it for. Note that I think you may want to only call MessageDigest.getInstance() once and then clone() from then on: IIRC the cloning is a lot faster.


Reverse the string to get another 32-bit hashcode and then combine the two:

String s = "astring";
long upper = ( (long) s.hashCode() ) << 32;
long lower = ( (long) s.reverse().hashCode() ) - ( (long) Integer.MIN_VALUE );
long hash64 = upper + lower;

This is pseudocode; the String.reverse() method doesn't exist and will need to be implemented some other way.


Do you look at Apache commons lang?

But for 64 bit (and 128) you need some tricks: the rules laid out in the book Effective Java by Joshua Bloch help you create 64 bit hash easy (just use long instead of int). For 128 bit you need additional hacks...


DISCLAIMER: This solution is applicable if you wish to efficiently hash individual natural language words. It is inefficient for hashing longer text, or text containing non-alphabetic characters.

I'm not aware of a function but here's an idea that might help:

  • Dedicate 52 of the 64 bits to representing which letters are present in the String. For example, if 'a' were present you'd set bit[0], for 'b' set bit 1, for 'A' set bit[26]. That way, only text containing exactly the same set of letters would have the same "signature".

You could then use the remaining 12 bits to encode the string length (or a modulo value of it) to further reduce collisions, or generate a 12 bit hashCode using a traditional hashing function.

Assuming your input is text-only I can imagine this would result in very few collisions and would be inexpensive to compute (O(n)). Unlike other solutions so far this approach takes the problem domain into account to reduce collisions - It is based off the Anagram Detector described in Programming Pearls (see here).

0

精彩评论

暂无评论...
验证码 换一张
取 消