Generate unique id from unique string input_问答_开发者

I have a table with a column of unique string values. The max length of the string value is 255 char. I want to generate a unique id with the string value as input. In other words I am looking for a compact representation for a string. The unique id generated can be alpha-numeric. A useful feature to have would be to be able to regenerate the string value from the unique id.

Is there an efficient function to generate such an unique id. Some ways could be using checksum or hash functions. I want to know if there is a standard way to do this.

I am using MySql database and java.

Thanks!

--edit: I am looking for a more compact representation rather than just using the string itsel开发者_如何学Cf.

How unique is "unique"? Using any good hashing function (MD5 is decent for most uses, and easily implemented via java.security.MessageDigest.getInstance("MD5") can get you to a 128-bit number that's very very likely to be unique. Using a subset of the hash gets you a smaller ID, with a higher chance of collision.

Using an auto_increment field in the DB, if it fits your design, might be easier to implement, will truly guarantee uniqueness, and will use smaller IDs than the 16 bytes of MD5. You can also then meet your requirement of finding the string by the key, which you can't do for a hash.

If you want to be able to go "Backwards" (The indexes are guaranteed unique and can be reversed to the original strings) what you are doing is compression, anything that isn't compression (checksum) is not reversable.

To do the compression, the simplest way would be to bit-pack and get each character down to the bare minimum number of bits.

A-Z is 26 chars which is less than 32 (5 bits)

add a-z and it's 6 bits (with somewhere around 12 bit-patterns left over to represent other characters).

Let's say that is enough for you. So you have 6x255 bits which is 1530 bits to store your string. (191 bytes)

Going with only caps would reduce that a little (to 159 bytes)

You can optimize it more, but then you have to go into a compression algorithm that expects a specific language or patterns in the Strings and optimizes those patterns.

Unless you can further specify the contents of the strings, you're just not going to get what you want. Sorry. (If you can tell more about the contents of the strings, do so. One of us may see patterns that will allow much better "Compression")

This lack of ability to do what you want is why hashtables are so cool. They get a "Mostly Unique" number and then have a second level of resolution to test cases where two strings hashed to the same number.

If your database requires that the column contain unique values, then why not use the string itself? Anything else is just another step to encode/decode it.

You have much much more possibilities for a 255 long string than a 64 (or whatever) bit long number. It is impossible. Add an auto_increment field.

Since you're using MySQL, take a look at CRC32

http://www.bitbybit.dk/carsten/blog/?p=191

public String getUniqueId(String uniqueString) {
    return uniqueString;
}

Unless the ID has any other constraints on it than "be unique".

If you have a limited number of strings that occur frequently, creating a reference table with a numeric (auto-increment) ID, and a FK to that reference table in your main table could be an option.

If not, you could run your strings through GZIP or any other compression algorithm if you need to retrieve the original.

If you don't need to retrieve the original, a hash function such as MD5 is what you're looking for.

Choosing the proper key shouldnt be taken easy.

You need to consider:

Replication: Is sharing of keys between different servers needed? If so, you most probably need some sort of unique hash or guid.
Size of the table/number of inserts: You should consider that most rdbms store the data physically on the hard drive by the order of their (clustered) primary key. Now imagine what happens, if you insert a hash value starting with 'a' on a table with a reasonable size. Yes, theres index padding, but eventually its full and the single line insert can cause the move of a couple of GB on the harddrive.
Need replication AND have big tables? Use both. Use a primary clustered auto increment (long)integer key and define a unique index on your hash column.