Why should the "prime-based" hashcode implementation be used instead of the "naive" one?_问答_开发者

Why should the "prime-based" hashcode implementation be used instead of the "naive" one?

开发者 https://www.devze.com 2022-12-23 11:52 出处：网络

I have seen that a prime number implementation of the GetHashCode function is being recommend, for example here. However using the following code (in VB, sorry), it seems as if that implementation gives the same hash density as a "naive" xor implementation. If the density is the same, I would suppose there is the same probability of collision in both implementations. Am I missing anything on why is the prime approach preferred?

I am supossing that if the hash code is a byte I do not lose generality for the integer case.

Sub Main()
    Dim XorHashes(255) As Integer
    Dim PrimeHashes(255) As Integer

    For i = 0 To 255
        For j = 0 To 255
            For k = 0 To 255
                XorHashes(GetXorHash(i, j, 开发者_高级运维k)) += 1
                PrimeHashes(GetPrimeHash(i, j, k)) += 1
            Next
        Next
    Next

    For i = 0 To 255
        Console.WriteLine("{0}: {1}, {2}", i, XorHashes(i), PrimeHashes(i))
    Next
    Console.ReadKey()
End Sub

Public Function GetXorHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
    Return CByte((valueOne Xor valueTwo Xor valueThree) Mod 256)
End Function

Public Function GetPrimeHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Byte
    Dim TempHash = 17
    TempHash = 31 * TempHash + valueOne
    TempHash = 31 * TempHash + valueTwo
    TempHash = 31 * TempHash + valueThree

    Return CByte(TempHash Mod 256)
End Function

The probability of collisions also depends on the expected distribution of the input data. In your example you assume input data that is uniformly distributed over the entire range. This is the ideal situation and it's no surprise that both algorithms perform well.

However, if you assume that the input data generally is similar in the high bits and differs mostly only in the low bits (note: a lot of real data is like this), the prime number method will spread this variation out over the whole hash whereas the XOR method will not - small changes in the low bits of two or more values can easily cancel each other out when XOR'ed. So the prime number method is less likely to collide in this case.

Also you should use 32-bit values for GetHashCode, not 8-bit values.

Truncating the hash is your problem here. The Xor method can only ever produce 256 distinct values. The Prime method can generate more than 750,000 distinct values, but you throw 749,744 of them away by using only the 8 low bits. And can thus never do a better job than Xor.

In your specific case, you can do much better. There are enough bits in an Integer to generate a unique hash with 16 million distinct values:

  Public Shared Function GetGoodHash(ByVal valueOne As Integer, ByVal valueTwo As Integer, ByVal valueThree As Integer) As Integer
    Return valueOne And 255 + (valueTwo And 255) << 8 + (valueThree And 255) << 16
  End Function

The Xor method is okay when the input values are well distributed. A problem with the prime method is that it is easy to trigger an Overflow exception. That's difficult to deal with in VB.NET code, it doesn't have the equivalent of the C# unchecked keyword. You have to turn that off globally with Project + Properties, Compile tab, Advanced Compile Options, tick "Remove integer overflow checks". Avoid that by computing the hash as an Int64. Which makes it a bit expensive.