I don't understand this explanation which says if n is the number of elements in the hash table and m is the total number of buck开发者_运维百科ets then hashtables have constant access time in average only if n is proportional to theta(n). Why does it have to be proportional ?
well actually m should be proportional to n. Otherwise you could, for example, have just 1 bucket and it would be just like an unsorted set.
To be more precise, if m is proportional to n, i.e. m = c * n, then the number of items in each bucket will be n/m = 1/c which is a constant. Going to any bucket is an O(1) operation (just compute the hash code) and then the search through the bucket is constant order (you could just do a linear search through the items in the bucket which would be a constant).
Thus the order of the algorithm is O(1), if m = c * n.
To take a converse example, suppose we had a fixed size table of size tableSize. Then the expected number of items in each bucket is n/tableSize which is a linear function of n. Any kind of search through the bucket is at best O(log(n)) for a tree (I'm assuming you don't stick another hash table inside the bucket or we then have the same argument over that hash table), so it would not be O(1) in this case.
Strictly speaking, the average-case time complexity of hash table access is actually in Ω(n1/3). Information can't travel faster than the speed of light, which is a constant. Since space has three dimensions, storing n
bits of data requires that some data be located at a distance on the order of n1/3 from the CPU.
More detail in my blog.
The chance of collisions is higher and thus the incidence of having to scan through the list of items with the same hash key is also higher.
Access time is constant because access is based on a calculation of a hash value and then a constant lookup to find the appropriate bucket. Assuming the hash function evenly distributes items amongst buckets, then the time it takes to access any individual item will be equal to the time to access other items, regardless of n.
Constant doesn't necessarily mean constantly low though. The average access time is related to the even distribution of the hashing function and the number of buckets. If you have thousands of items evenly distributed amongst a small number of buckets, you're finding the bucket fast but then looping through a lot of items in the bucket. If you have a good proportion of buckets to items but a bad hash function that puts many more items in some buckets rather than other, the access time for the items in larger buckets will be slower than access time for others.
A reasonably-sized hash table, where there are enough slots for every element you store and plenty of extra space, will have the hashing function doing most of the work choosing slots and very few collisions where different elements have the same hash. A very crowded hash table would have lots of collisions, and would degrade to basically a linear search, where almost every lookup will be a wrong item that had the same hash and you'll have to keep searching for the right one (a hash table lookup still has to check the key once it picks the first slot, because the key it's looking for might have had a collision when it was stored).
What determines the hit-collision ratio is exactly the ratio of number-of-items to size-of-hash (i.e., the percentage chance that a randomly chosen slot will be filled).
精彩评论