Low collision email hashing algorithm?_问答_开发者

Background:

We are building a mailing tool and have currently separated emailaddresses into a separate table, so that a single emailaddress is only stored once and is instead referenced by its id. We find this to be a good idea as the number of recipients per email can be huge and it's likely that most emailaddresses will receive well in excess of 100 emails.

However, when a user is importing emailaddresses into a list or similar operations, we first need to bulk-insert to make sure all emailaddresses have ids, we just ignore collisions, that works. However, when we then want to insert them into a list we must fetch the emailaddresses one-by-one or with a huge IN-query with emailaddresses (as a list references an emailaddress by id), not very tempting!

EDIT: Users may be importing 100.000+ emailaddresses, for 1000 or so emailaddresse开发者_开发技巧s it's not a real issue to query one-by-one of course.

Question:

So one idea is to hash each emailaddress and use that as an id instead. Meaning, we can predict the id for all emailaddresses without having to query for them. But are there any good algorithms out there as storing 16bytes/128bit+ kind of defeats the purpose... 64bit should be enough no? Which would be appreciated given that this should all be indexed too.

Any recommendations? What if we were to simply take the first 8bytes from the MD5? Is 8bytes from SHA1 better? Perhaps there are more specialized algorithms? I'm not all read up on collision probability, but I'm curious as to how well existing algorithms works when trimmed down and as emails are lower-case and mostly letters or numbers. (Note the dataset will potentially be huge)

PS. We are using PHP so it somewhat limits our ability to implement special algorithms.

Not sure i understand your use-case, but put a unique key constraint on the email address column...

There is a huge list of issues with your current approach and its limitations.

Most of which are simply solved by maintaining a table of email addresses with the ID as primary key (auto-increment on MySQL or SQLite, sequence elsewhere) and a unique index on email address.

Why your "users" are manipulating large lists of email addresses is far from clear. It sounds like a large proportion of your data (i.e. recipients on a specific list) is not maintained within your database. You should never "fetch the emailaddresses one-by-one or with a huge IN-query with emailaddresses".

Cutting down the output of md5 or sha undermines the uniqueness of the hash and makes collisions much more likely.

Before you do anything drastic, check your query plan (how you do that depends on the particular database server you use, check its documentation).

See if you can't get an index to work on the email addresses. This should speed things up a little bit, though the planner might skip them because you are inserting huge amounts of data.

When (and only when) you have tried that, you can look into the hashing issue.

I don't know any algorithms specifically designed for hashing email addresses, and while you could use MD5 it is designed to be used when the probability for collisions must be so small it basically never happens (I don't think anybody has spotted an MD5 collision in the wild). This can be done, but it is computationally expensive. This is even worse if you use SHA.

In your case I would suggest something simpler: first since we can assume all emails are in the form

<someName>@<someServer>

what I would do would be to split the email into the two parts, strip all non letter, nonnumeric chars from each.

Next we can compute the numeric value for each of the two parts, which we get by summing up the ascii value of each individual letter (you stripped everything else, so there isn't going to be a problem with multibyte chars).

At this point all that is left to do is to combine the two sums, and since we can expect there to be far fewer possible senders we can spend only two bytes to hold the name of the server.

In pseudo code:

function emailHash(namePart, serverPart){
  $someName = asciiStrip(namePart)
  $someServer = asciiStrip(serverPart)
  $someNameSum = 0 
  $someServerSum = 0 
  foreach($letter in $someName){
    $someNameSum += asciiValue($letter)
  }
  foreach($letter in $someServer){
    $someServerSum += asciiValue($letter)
  }
  return ($someNameSum % 2^6)*2^2 + $someServerSum % 2^2
}

Edit based on comments

You are right, that one is really poor. However there is another interesting thing you could do, although it would be a bit more difficult to implement.

After we strip away the foreign chars there are only 36 possible chars so we need only 6 bits to store each value. With 48 bits of storage for the username part it is possible to store 8 chars from the email address. How low would the collision get for that?

The could be improved by somehow quashing the numbers (say storing them after dividing them by two) so that in total we are dealing with only 32 numbers. Then it is possible to store each digit in only 5 bits for a total of 9 chars.

If this doesn't give a low enough collision rate, you might have to use MD5 which should (assuming the algorithm gives a perfect distribution) only give you 1 in several billion billion chance of collision.