开发者

Hash quality and stability of String.GetHashCode() in .NET?

开发者 https://www.devze.com 2022-12-16 19:24 出处:网络
I am wondering about the hash quality and the hash stability produced by the String.GetHashCode() implementation in .NET?

I am wondering about the hash quality and the hash stability produced by the String.GetHashCode() implementation in .NET?

Concerning the quality, I am focusing on algorithmic aspects (hence, the quality of the hash as it impacts large hash-tables, not for security concerns).

Then, concerning the stabil开发者_运维问答ity, I wondering about the potential versionning issues that might arise from one .NET version to the next.

Some lights on those two aspects would be very appreciated.


I can't give you any details about the quality (though I would assume it is pretty good given that string is one of the framework's core classes that is likely to be used as a hash key).

However, regarding the stability, the hash code produced on different versions of the framework is not guaranteed to be the same, and it has changed in the past, so you absolutely must not rely on the hash code being stable between versions (see here for a reference that it changed between 1.1 and 2.0). In fact, it even differs between the 32-bit and 64-bit versions of the same framework version; from the docs:

The value returned by GetHashCode is platform-dependent. For a specific string value, it differs on the 32-bit and 64-bit versions of the .NET Framework.


This is an old question, but I'd like to contribute by mentionning this microsoft bug about hash quality.

Summary: On 64b, hash quality is very low when your string contains '\0' bytes. Basically, only the start of the string will be hashed.

If like me, you have to use .Net strings to represent binary data as key for high-performance dictionaries, you need to be aware of this bug.

Too bad, it's a WONTFIX... As a sidenote, I don't understand how they could say that modifying the hashcode being a breaking change, when the code includes

// We want to ensure we can change our hash function daily.
// This is perfectly fine as long as you don't persist the
// value from GetHashCode to disk or count on String A
// hashing before string B. Those are bugs in your code.
hash1 ^= ThisAssembly.DailyBuildNumber;

and the hashcode is already different in x86/64b anyway.


I just came across a related problem to this. On one of my computers (a 64 bit one) I had a problem that I tracked down to 2 different objects being identical except for the (stored) hashcode. That hashcode was created from a string....the same string!

m_storedhash = astring.GetHashCode();

I dont know how these two objects ended up with different hash codes given they were from the same string however I suspect what happened is that within the same .NET exe, one of the class library projects I depend upon has been set to x86 and another to ANYCPU and one of these objects was created in a method inside the x86 class lib and the other object (same input data, same everything) was created in a method inside the ANYCPU class library.

So, does this sound plausible: Within the same executable in memory (not between processes) some of the code could be running with the x86 Framework's string.GetHashCode() and other code x64 Framework's string.GetHashCode() ?


I know that this isn't really included the meanings of quality and stability that you specified, but it's worth being aware that hashing extremely large strings can produce an OutOfMemoryException.

https://connect.microsoft.com/VisualStudio/feedback/details/517457/stringcomparers-gethashcode-string-throws-outofmemoryexception-with-plenty-of-ram-available


The quality of the hash codes are good enough for their intended purpose, i.e. they doesn't cause too many collisions when you use strings as key in a dictionary. I suspect that it will only use the entire string for calculating the hash code if the string length is reasonably short, for huge strings it will brobably only use the first part.

There is no guarantee for stability across versions. The documentation clearly says that the hashing algorithm may change from one version to the next, so that the hash codes are for short term use.

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号