I'm indexing some URLs based on their hash code and use this hash to retrieve them. I have 2 questions in this matter:
- Do you think this is a good approach? I mean sometimes two different URLs can produce the same hash but I don't seem to have any other choice since URLs can be very long and I need to produce a file name for them.
- [More important] Sometimes two d开发者_如何学编程ifferent URLs are actually reffering to the same page (e.g. http://www.stackoverflow.com and http://stackoverflow.com and sometimes URLs with % characters) but I need to produce the same hash code for these URLs. What do you suggest?
Thanks.
Definitely don't use the .NET String hash code - there's no guarantee it'll do the same thing between versions (and did actually change between .NET 1.1 and .NET 2.0). It's also quite likely to have collisions, and is very short 32 bits).
If you really have to use a hash, use a cryptographic hash as that's much less likely to result in collisions - you could use SHA-256, for example. Note that crypto hashes generally work in terms of binary data, so you'll need to convert the URL to a byte array first, e.g. with Encoding.UTF8.GetBytes(text)
. This isn't foolproof, but it's at least "very unlikely" to produce collisions. Of course, as the hash is rather bigger, your output filename will be bigger too. (You'll need to convert from a byte[]
to a string as well, I assume - I suggest you use Convert.ToBase64String
).
Does your filename really have to be derived from the URL though? Couldn't you just generate random filenames (or increment a counter) and then store the mapping between URL and filename somewhere? That's a much more sensible approach IMO - and it's reversible (so you can tell which URL generated a particular file).
As for your second question - basically you'll need to find some way of deriving a canonical URL from any given URL, so that all "equivalent" URLs are converted to the same canonical one... and that's what you hash or store.
Indexing based on hash codes is a path to bugs. Hash codes are not unique and do have collisions. If you index on a hash code it will lead to a situation where two non-equal values end up retrieving the same mapped value from your data table.
After lots of discussion and thinking, since there is no answer that completely answers my questions, I'm going to answer my own question. The one thing important is that the comment posted by Morten Mertner is the closest thing to my answer but I cannot select it as an answser.
- There is no other way for me except using a hash algorithm. But to reduce the risk of duplicate, I should use better algorithms like SHA-2 ones.
- As Morten Mertner said, in some cases the mentioned URLs are NOT actually the same and I cannot assume that the website is configured correctly. The only thing I can do is to remove the bookmarks and either use ecoded/decoded version of the URL. (The versions with/without % characters).
Thanks for all of the help guys.
精彩评论