开发者

hashCode, implementation, and relation to HashMap

开发者 https://www.devze.com 2023-02-14 13:48 出处:网络
So I asked another related question here: java string hash function with avalanche effect, but I have a different, related question now.

So I asked another related question here: java string hash function with avalanche effect, but I have a different, related question now.

What I established in that question was that the hashCode() function for String does not have an avalanche effect. This means, for example, that if I have strings "k1", "k2", "k3", and I call hashCode() on each, the values returned will be contiguous.

Now, based on my recollection of data structures 101, I was under the impression that this is a bad thing. Because assuming that HashMap chooses buckets by an algorithm something like:

class HashM开发者_JAVA百科ap {
    private int capacity;
    private int chooseBucket(String key) {
        return key.hashCode() % capacity;
    }
}

It would mean that similar keys are stored in contiguous buckets, leading to a higher rate of collisions, degrading big-O lookup time from O(1) to be...who knows how bad...maybe worse than O(log n).

The types of answers I got to my first question were along the lines of 'avalanche effect isn't needed here', 'it's only for cryptography hash functions', and 'the hashCode implementation for strings is fast and works well for small hash maps'.

Which confuses me. All data structures are fast when they're small. Wouldn't Sun provide a default hashCode function that will work well for large data sets? That's when the performance of HashMap really matters anyway, isn't it?

Or, am I missing something? Please enlighten me.


Storing keys in contiguous buckets does not cause performance degradation. Storing keys in the same bucket (e.g., chaining) does. When using chaining to resolve hash collisions:

  • Worst-case scenario: that every hash value is the same, so all elements end up in the same bucket, in which case you get O(n) performance (assuming the chains are linked lists)
  • Best-case scenario: every hash value is different, so each element ends up in a different bucket, so you get the expected O(1) performance.

Hash codes for use in hash tables (and the like) do not need an avalanche effect.


You asked "Or, am I missing something? Please enlighten me."

Yes, you are missing something.

Inside the HashMap class implementation, it protects against poor hashing functions:

/**
 * Applies a supplemental hash function to a given hashCode, which
 * defends against poor quality hash functions.  This is critical
 * because HashMap uses power-of-two length hash tables, that
 * otherwise encounter collisions for hashCodes that do not differ
 * in lower bits. Note: Null keys always map to hash 0, thus index 0.
 */
static int hash(int h) {
    // This function ensures that hashCodes that differ only by
    // constant multiples at each bit position have a bounded
    // number of collisions (approximately 8 at default load factor).
    h ^= (h >>> 20) ^ (h >>> 12);
    return h ^ (h >>> 7) ^ (h >>> 4);
}

So your resulting hashcodes in your example are:

k1 - Before: 3366 After: 3566
k2 - Before: 3367 After: 3567
k3 - Before: 3368 After: 3552

So even in your small sample size of 3 elements, one of them got rehashed. Now, this doesn't guard against aggressively evil hashCodes (return randomInt(); or return 4; simply can't be protected against) but it does guard against poorly written hashcodes.

I should also point out that you can change things a great deal by using non trivial inputs. Consider for example the following strings.

k1longer - Before: 1237990607 After: 1304548342
k2longer - Before: 2125494288 After: 2040627866
k3longer - Before: -1281969327 After: -1178377711

Notice how much different the lower bits are: those are the only things that matter for a hashcode is the low bits. The size of the backing map is always a power of two. It's actually documented that way in the code:

/**
 * The table, resized as necessary. Length MUST Always be a power of two.
 */
transient Entry[] table;

The rehashing does a fairly decent job of making sure that the upper bits (which are typically ignored in the hash table) still have an impact on the lower bits. Here's the mapping of original hashcode positions and the bits they effect:

00: 00000000000000000000000000000001
01: 00000000000000000000000000000010
02: 00000000000000000000000000000100
03: 00000000000000000000000000001000
04: 00000000000000000000000000010001
05: 00000000000000000000000000100010
06: 00000000000000000000000001000100
07: 00000000000000000000000010001001
08: 00000000000000000000000100010010
09: 00000000000000000000001000100100
10: 00000000000000000000010001001000
11: 00000000000000000000100010010000
12: 00000000000000000001000100100001
13: 00000000000000000010001001000010
14: 00000000000000000100010010000100
15: 00000000000000001000100100001000
16: 00000000000000010001001000010001
17: 00000000000000100010010000100010
18: 00000000000001000100100001000100
19: 00000000000010001001000010001001
20: 00000000000100010010000100010011
21: 00000000001000100100001000100110
22: 00000000010001001000010001001100
23: 00000000100010010000100010011000 # means a 1 in the 23rd bit position will  
24: 00000001000100100001000100110001  # cause positions 4, 5, 8, 12, and 20 to 
25: 00000010001001000010001001100010  # also be altered
26: 00000100010010000100010011000100
27: 00001000100100001000100110001001
28: 00010001001000010001001100010010
29: 00100010010000100010011000100100
30: 01000100100001000100110001001000
31: 10001001000010001001100010010000

So, your worries about "degrading big-O lookup time from O(1) to be...who knows how bad...maybe worse than O(log n)" and "Wouldn't Sun provide a default hashCode function that will work well for large data sets?" may be put to rest - they have safeguards in place to prevent that from happening.

If it helps you get some peace at all, here's the author tags for this class. They're literally all stars in the Java world. (comments with # are mine)

 * @author  Doug Lea          # Formerly a Java Community Process Executive Committee member
 * @author  Josh Bloch        # Chief Java architect at Google, amongst other things
 * @author  Arthur van Hoff   # Done too many hardcore Java things to list...
 * @author  Neal Gafter       # Now a lead on the C# team at Microsoft, used to be team lead on javac 


I read a blog entry from Eric Lippert the other day titled Guidelines and rules for GetHashCode. Although the code examples are relevant to C#, most of the general principles apply equally well to Java. This article is well worth a read if you want to understand more about what hash codes are used for and how they should be generated.

In particular, the following bit seems particularly relevant to your question:

Guideline: the distribution of hash codes must be "random"

By a "random distribution" I mean that if there are commonalities in the objects being hashed, there should not be similar commonalities in the hash codes produced.


A hashing function for something like a HashMap needs to be reasonably unique for it's key set but the relationship between the keys (ie how alike two keys are) need not be random. What we really want to avoid is a bunch of objects in a single bucket which would make searching that bucket expensive.

In the case of HashMaps and Strings it has to map those hashed keys into some sort offset to a random accessible container such as an array for which there are a number of solutions but if two keys are "close" it will still result in them being placed in different buckets which is all we really care about.

For very large Map containers (think billions of keys) we probably do want to be a little more clever but that seems beyond what Java's HashMap was designed for.

One final note, you don't have to use the avalanche effect to produce fairly random keys for Strings. You want to choose a function that is random enough and fast as possible.


If you have a look at the source code of HashMap there is hash function called with the key.hashCode() value which means it goes through its own way of assigning a hash. One point to be sure about is not to obey the equals and hashcode contract. I would suggest that if you are looking for performance improvement to look into the source code and understand the number of buckets available and the optimum usage of it.

0

精彩评论

暂无评论...
验证码 换一张
取 消