开发者

Acceptable types to use as keys in a HashTable

开发者 https://www.devze.com 2022-12-09 23:41 出处:网络
I must admit to having only a rudimentary understanding of how HashTables work, although from what little I do know it seems fairly straightforward. My question is just this: it seems that the convent

I must admit to having only a rudimentary understanding of how HashTables work, although from what little I do know it seems fairly straightforward. My question is just this: it seems that the conventional wisdom is to use simple, basic value types such as integers for the keys in a HashTable. However, strings are also often used, even though in many languages they are implemented as reference types. What I feel is generally not encouraged is using complex re开发者_开发问答ference types; I'm guessing this is because doing so would necessitate a slower hash function? But then why are strings so commonly used? After all, isn't a string internally a char[] array (again, in most languages)?

In the end, what value types are generally regarded as the "best" (or even simply "acceptable") choices to use as keys in a HashTable? And are there any commonly used choices that are actually regarded as "bad" (like strings, possibly)?


It's not a matter of strings versus integers, or value versus reference, but of mutable keys versus immutable keys. As long as the keys are immutable (and thus their hashing value never change) they are OK to index a hash table. For instance, strings in Java are immutable and thus perfectly suited as hashtable keys.

By the way, if a data type is simpe enough to be always passed by value (like scalars), then it will of course be OK.

But now imagine that you use a mutable type ; if you give me a reference to one of these objects as a key, I will compute its hash value and then put it in one of my hashtable buckets. But when you later modify the object, I will have no way to be notified ; and the object may now reside in the wrong bucket (if its hash value is different).

Hope this helps.


Most string implementations, while they might appear as references types in managed environments their implementation is typically an immutable type.

What the hash function does is that it maps a very large number of states onto a smaller number of states.

That is why string hashing is good for testing string equality. You can map the value to an index of an array, and look up some information about that value very quickly. You don't need to compare every character with every other character in every other string. And you can say just about the same thing about anything. It's all about reducing, or fingerprinting an arbitrary number of bytes in some manner which is useful.

This is where the discussion about the type of key you use in a hash table becomes invalid, because it's the mapping of that value into a smaller state space and how that's utilized internally which makes it useful. An integer is typically hardware friendly, but 32-bits isn't really a large space and collisions are likely within that space for arbitrary inputs.

In the end, when you do use a hash table, the cost of calculating the hash value is irrelevant compared to the time it would take to compare every value with every other value in every other possible position (assuming that your hash table contains hundreds of items).


As long as a suitable hash function is provided all types will do as keys. Remember after all a hash table is just a linear array. The hash function takes a key of a certain type and computes an index in the hash table array (called bucket) where the value gets stored (there are some issues with collisions though).

So the real tricky part is finding a hash function. Of course it should have certain properties, like being simple to compute, chaotic (nearly identical keys should be mapped to completly different hash table buckets), deterministic (same keys means same hash table bucket), uniformity (all possible keys are mapped evenly to the buckets), or surjectivity (all buckets of the hash table should be used).

It seems it is easier to define such a function for simple types like integers.


The best hash keys are those that

  1. Have good (as in low collisions) hashes (see Object.GetHashCode for .NET, Object.hashcode for Java)
  2. Have quick comparisons (for when there are hash collisions).

All that said, I think Strings are good hash keys in most cases, since there are many excellent hash implementations for Strings.


If you were to use a complex type as a key then:

  • It would be hard for the hash table implementation to group items into buckets for fast retrieval; how would it decide how to group a range of hashes into a bucket?
  • The hash table may need to have intimate knowledge of the type in order to pick a bucket.
  • There is the risk of the properties of the object changing, resulting in items ending up in the wrong buckets. The hashes must be immutable.

Integers commonly used because they are easy to split into ranges that correspond to the buckets, they are value types and therefore immutable, and they are fairly easy to generate.

0

精彩评论

暂无评论...
验证码 换一张
取 消