开发者

PHP - What is a good way to produce a short alphanumeric string from a long md5 hash?

开发者 https://www.devze.com 2023-01-08 17:44 出处:网络
This is for the purpose of having a nice sh开发者_C百科ort URL which refers to an md5 hash in a database. I would like to convert something like this:

This is for the purpose of having a nice sh开发者_C百科ort URL which refers to an md5 hash in a database. I would like to convert something like this:

a7d2cd9e0e09bebb6a520af48205ced1

into something like this:

hW9lM5f27

Those both contain about the same amount of information. The method doesn't have to be direct and reversible but that would be nice (more flexible). At the least I would want a randomly generated string with the hex hash as the seed so it is reproducible. I'm sure there are many possible answers, I am curious to see how people would do it in an elegant way.

Oh, this doesn't have to have perfect 1:1 correspondence with the original hash but that would be a bonus (I guess I already implied that with the reversibility criteria). And I would like to avoid collisions if possible.

EDIT I realized my initial calculations were totally wrong (thanks to the people answering here but it took me awhile to clue in) and you can't really reduce the string length very much by throwing in all the lower case and uppercase letters into the mix. So I guess I will want something that doesn't directly convert from hex to base 62.


Here's a little function for consideration:

/** Return 22-char compressed version of 32-char hex string (eg from PHP md5). */
function compress_md5($md5_hash_str) {
    // (we start with 32-char $md5_hash_str eg "a7d2cd9e0e09bebb6a520af48205ced1")
    $md5_bin_str = "";
    foreach (str_split($md5_hash_str, 2) as $byte_str) { // ("a7", "d2", ...)
        $md5_bin_str .= chr(hexdec($byte_str));
    }
    // ($md5_bin_str is now a 16-byte string equivalent to $md5_hash_str)
    $md5_b64_str = base64_encode($md5_bin_str);
    // (now it's a 24-char string version of $md5_hash_str eg "VUDNng4JvrtqUgr0QwXOIg==")
    $md5_b64_str = substr($md5_b64_str, 0, 22);
    // (but we know the last two chars will be ==, so drop them eg "VUDNng4JvrtqUgr0QwXOIg")
    $url_safe_str = str_replace(array("+", "/"), array("-", "_"), $md5_b64_str);
    // (Base64 includes two non-URL safe chars, so we replace them with safe ones)
    return $url_safe_str;
}

Basically you have 16-bytes of data in the MD5 hash string. It's 32 chars long because each byte is encoded as 2 hex digits (i.e. 00-FF). So we break them up into bytes and build up a 16-byte string of it. But because this is no longer human-readable or valid ASCII, we base-64 encode it back to readable chars. But since base-64 results in ~4/3 expansion (we only output 6 bits per 8 bits of input, thus requiring 32 bits to encode 24 bits), the 16-bytes becomes 22 bytes. But because base-64 encoding typically pads to lengths multiples of 4, we can take only the first 22 chars of the 24 character output (the last 2 of which are padding). Then we replace non-URL-safe characters used by base-64 encoding with URL-safe equivalents.

This is fully reversible, but that is left as an exercise to the reader.

I think this is the best you can do, unless you don't care about human-readable/ASCII, in which case you can just use $md5_bin_str directly.

And also you can use a prefix or other subset of the result from this function if you don't need to preserve all the bits. Throwing out data is obviously the simplest way to shorten things! (But then it's not reversible)

P.S. for your input of "a7d2cd9e0e09bebb6a520af48205ced1" (32 chars), this function will return "VUDNng4JvrtqUgr0QwXO0Q" (22 chars).


Here are two conversion functions for Base-16 to Base-64 conversion and the inverse Base-64 to Base-16 for arbitrary input lengths:

function base16_to_base64($base16) {
    return base64_encode(pack('H*', $base16));
}
function base64_to_base16($base64) {
    return implode('', unpack('H*', base64_decode($base64)));
}

If you need Base-64 encoding with the URL and filename safe alphabet , you can use these functions:

function base64_to_base64safe($base64) {
    return strtr($base64, '+/', '-_');
}
function base64safe_to_base64($base64safe) {
    return strtr($base64safe, '-_', '+/');
}

If you now want a function to compress your hexadecimal MD5 values using URL safe characters, you can use this:

function compress_hash($hash) {
    return base64_to_base64safe(rtrim(base16_to_base64($hash), '='));
}

And the inverse function:

function uncompress_hash($hash) {
    return base64_to_base16(base64safe_to_base64($hash));
}


You could just do plain old base conversion. The hash is expressed in hexadecimal, and you can then create an alphabet of the size you want to express the hash. Base64 works well for this purpose, although you will probably want to write your own function so you end up encoding the value, not the string.

Note, however, that standard Base64 contains characters you wouldn't want to put in a URL; +, / and the padding character =. You can replace those characters with something else when converting back and forth to get a URL-safe Base64 encoding (or use a safe set of characters to begin with if you write your own function).


I would advise against a 1-1 correspondence:

With base-64 encoding you will only be able to decrease the input to (4/8)/(6/8) -> 4/6 ~ 66% in size (and this is assuming you deal with the "ugly" base64 characters without adding anything new).

I would likely consider a (secondary) look-up method to get truly "pretty" values. Once you have this alternate method established, choosing how to generate values in that range -- e.g. random numbers -- can be free of the source hash value (because correspondence is lost anyway) and an arbitrary "pretty" target-set can be used, perhaps [a-z][A-Z][0-9].

You can convert to the base (62 above) by simply following the divide-and-carry method and a look-up into an array. It should be fun little exercise.

Note: If you choose the random number from [0, 62^5) then you will get a value that will completely pack the encoded output (and fit within 32-bit integer values). You can then perform this process multiple times in a row to get a nice multiple of-5 result value, such as xxxxxyyyyyzzzzzz (where x,y,z are different groups and the total value is in the range (62^5)^3 -> 62^15 -> "a huge value")

Edit, for comment:

Because without the 1-1 correspondence you can make truly short pretty things -- perhaps as "small" as 8 characters long -- with base62, 8 characters can store up to 218340105584896 values, which is likely more than you will ever need. Or even 6 characters which "only" allows storage of 56800235584 different values! (And you still can't store that number in a plain 32-bit integer :-) If you drop to 5 characters, you once again reduce the space (to just under one billion: 916,132,832), but now you have something that can fit in a signed 32-bit integer (albeit it is somewhat wasteful).

The DB should ensure no duplicates, albeit an index on this value will be "fast-fragmenting" with a random source (but you could use counters or whatnot). A well-distributed PRNG should have minimal conflicts (read: retries) in a large enough range (assuming you keep the seed rolling and don't reset it, or reset it appropriately) -- Super 7 can even guarantee NO duplicates during a cycle (of only ~32k), but as you can see above, the target space is still big. See the math at the top of what maintaining a 1-1 relationship requires in terms of minimum encoded size.

The divide-and-carry method just explains how to get your source number into a different base -- perhaps base62. The same general method can be applied to go from the "natural" base (base10 in PHP) to any base.


Of course if I want a function to satisfy my needs perfectly I better make it myself. Here is what I came up with.

//takes a string input, int length and optionally a string charset
//returns a hash 'length' digits long made up of characters a-z,A-Z,0-9 or those specified by charset
function custom_hash($input, $length, $charset = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUFWXIZ0123456789'){
    $output = '';
    $input = md5($input); //this gives us a nice random hex string regardless of input 

    do{
        foreach (str_split($input,8) as $chunk){
            srand(hexdec($chunk));
            $output .= substr($charset, rand(0,strlen($charset)), 1);
        }
        $input = md5($input);

    } while(strlen($output) < $length);

    return substr($output,0,$length);
}

This is a very general purpose random string generator, however it is not just any old random string generator because the result is determined by the input string and any slight change to that input will produce a totally different result. You can do all sort of things with this:

custom_hash('1d34ecc818c4d50e788f0e7a9fd33662', 16); // 9FezqfFBIjbEWOdR
custom_hash('Bilbo Baggins', 5, '0123456789bcdfghjklmnpqrstvwxyz'); // lv4hb
custom_hash('', 100, '01'); 
// 1101011010110001100011111110100100101011001011010000101010010011000110000001010100111000100010101101

Anyone see any problems with it or any room for improvement?


It depends on what a7d2cd9e0e09bebb6a520af48205ced1 is. Assuming you're talking about a hex number since it's coming from md5, you could just run a base64_encode. If you have the hex in string form, you'd want to run hexdec. Be careful you don't run into maxint issues though.

0

精彩评论

暂无评论...
验证码 换一张
取 消