What will be perfect hash fn. and smallest size hash for billions of URLs?_问答_开发者

What will be perfect hash fn. and smallest size hash for billions of URLs?

开发者 https://www.devze.com 2023-02-01 23:42 出处：网络

To index 100 billion URLs and further what has function will work perfect with no collisions. Since URLs are unique strings, i suppose any string hash func. like MD5 will be good, but need input from

相关专题：.net hash

To index 100 billion URLs and further what has function will work perfect with no collisions. Since URLs are unique strings, i suppose any string hash func. like MD5 will be good, but need input from experts.

Also we want to search the URL set (DB table as of now) via hash, so obviously shorter hash will be efficient开发者_如何学运维 in search time and space.

Can i specify fixed hash-length ?

We are using C# .NET 4.0

Are you sure your DB table isn't the way to go? That is a lot of requirements for a hash function. Most hash functions won't let you set the length of the hash, and requiring the hash to be perfect narrows it further. Do you need all of these requirements? More than likely, a much simpler solution will work just as well.

Are you reading this off disk? (100-billion URLS, assuming a URL length of 4 for a domain, + 4 for ".com" + "/" + 3 more = 12 bytes per URL = 1.09 TiB - and that is a very conservative estimate.) You might want to look into disk-friendlier structures, such as B-Trees (and their derivatives, such as B+-trees) - these data structures offer efficient (log(n) theoretically, but can beat hash-tables in some common cases) lookup, removal, insertion. Databases typically use these for indexes over hashes, which should give a hint as to their performance. (And which brings me back to my initial question: are you sure your DB table isn't the way to go?)

If you use a hash, even one with collisions will work. Something like SHA256, while being relatively expensive to calculate, will have a collision rate acceptably low. (I believe it's so low, you're more likely to be struck by lightning. Multiple times. People use UUIDs with no fear of collision, which have less than half as many bits as a SHA256 hash.) The CPU cost of a SHA256 might not matter, if you're going to follow it up with disk access.

(Also: is your DB table of URLs properly indexed to allow fast searching on that field?)