开发者

Creating unique non guessable base 36 id

开发者 https://www.devze.com 2023-02-02 19:26 出处:网络
For an application similar to URL shortener services, I want to create non guessable id\'s, which you\'re all familiar with I presume. Here\'s an example of such an id:

For an application similar to URL shortener services, I want to create non guessable id's, which you're all familiar with I presume. Here's an example of such an id:

http://example.com/sd23t9

What is a good and efficient technique to generate these, with minimal (or no) risk of collision when inserting these as a primary key in a database开发者_开发问答 table?

EDIT:

Piskvor makes an excellent point of course. I should have mentioned that I meant minimal collision risk before the 36^6 limit is met.

EDIT 2

Eh, scrap that, his point was illustrating much more than that of course. Hmmm. Pregenerating a table with id's then, perhaps (like I've read elsewhere already)? Would that be the most efficient technique when i'm bound to the 36^6 and none-consecutive constraints perhaps?


Set ID length. // e.g. 6
do {
  Generate a short random ID of the given length
  Collision?
   - No:
      DONE!
   - Yes:
      increase ID length
 } while true

For any finite ID length, there's always a risk of collision: assuming from your example that you'd have [a-z0-9]{6} IDs, as soon as you have 2,176,782,336 IDs, a collision is 100% guaranteed (no more available keys). Due to birthday problem, you'll get collisions much, much sooner. With such a small keyspace, there isn't a way to avoid collisions - you'll need to have some sort of collision recovery instead.

You could generate an ID until it doesn't collide - but this will get progressively slower as the keyspace is being filled: imagine a keyspace of [a-z], with [a-n] and [p-z] already taken - now every new random ID is more likely to collide than not; and when you completely fill the keyspace, the loop will not terminate at all.

The algorithm I propose is maybe overcautious in this regard: if it finds a collision, it will try progressively longer IDs (as it assumes "collision => not feasible to check the shorter keyspace"). Although it's not efficient, it's likely to find a non-colliding ID within a few iterations.


A bit of a weird idea. Why not to use permutations? e.g. you have a set of values [0-9a-z] when you generate first id. you take a first permutation in lexicographical order. then second and so on. to make it appear less guessable you can change the rules of lexicographic order. say 'a' goes after 't' or something like that. you can use tuple instead of a full permutations as well. This will ensure there is no collisions.

What this idea is actually about is making some sort of a two way hash function. basically if you are able to encode number "1" in some way to get something like "q8d3dw" and be able to decode "q8d3dw" back to "1" you can be certain that this function will give you unique strings for all the values from 1 to 36^6.

The problem is actually selecting this function. The easy way would be to associate "1" to "000000", "2" to "000001", "12" to "00000b". Basically arrange all available strings in lexicographical order and select the string on a position which is equal to the id. This however is really easy to guess. So what you could do is to artificially change rules of lexicographic order. Say instead of having a normal order(0,1,2,3...a,b,c...x,y,z) you could shuffle it a bit and get something like (a,5,t,3...). This will produce a bit more obfuscated results. However it will still be quite guessable because the first element would be "aaaaaa", second "aaaaa5", then "aaaaat". So you can change rules of lexicographic order even further, making them depended on the position of the character. Say order for the first char id (a,5,t,3...) for the second one (y,7,3,r...), etc.

Now, i will not post any pseudo code as it will be quite a long one. And I'm not advising you to go with that route unless you are interested in creating this sort of algorithms for fun :). However if you will go with this route it could be a very efficient way of generating this id's with no chance of collision. And I'll advise you to read volume 4 of "Art of computer programming" by Dr Donald Knuth. There are lots of suggestions in implementation of such algorithms.


@ivan: here is an implementation.

First you need the 6 strings, here's code to generate them:

$letters = range('a', 'z');
$numbers = range(0, 9);
$char_list = array_merge_recursive($letters, $numbers);
for ($i = 0; $i <= 5; $i++) {
    shuffle($char_list);
    echo '$mysterykey[' . $i . "] = '" . implode($char_list) . "';\n";
}

Then you can use the output from that in the function:

function int_to_x36($value) {
    $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97';
    $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef';
    $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk';
    $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc';
    $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w';
    $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh';

    $x36 = array();
    for ($i = 5; $i >= 0; $i--) {
        $x36[$i] = 0;  
        $y = pow(36, $i);

        if ($value >= $y) {
            $z = floor($value/$y);
            $value = $value - ($z * $y);
            $x36[$i] = $z;
        }   
    }      

    $encoded = '';
    for ($i = 0; $i <= 5; $i++) {
        $encoded .= $mysterykey[$i][$x36[$i]];
    }

    return $encoded;
}

function x36_to_int($value) {
    $mysterykey[0] = 'awkbs81t3jyg20v4qhoxzcuenr65dfimlp97';
    $mysterykey[1] = 'ut17qclz96n3msky8jwp4r25gdvhxo0biaef';
    $mysterykey[2] = 'cewszx142nph05mi9ulafbdvy36o8gq7trjk';
    $mysterykey[3] = '37hp28wjdqe5vnlzxofrsybu4im90k16agtc';
    $mysterykey[4] = 'n9a3jksl5ct0eqfzo7pvgyd4m2hiu18xrb6w';
    $mysterykey[5] = 'mq0nbk3zs529gu4tecli8j176vardxoypfwh';

    $value36 = str_split($value);

    $x36 = array();
    for ($i = 0; $i <= 5; $i++) {
        $x36[$i] = strpos($mysterykey[$i], $value36[$i]);
    }

    $x = 0;
    for ($i = 5; $i >= 0; $i--) {
        $x += $x36[$i] * pow(36, $i);
    }      

    return $x;
}

I'm sure I've overlooked something but it seems to be working.


Big random number and SHA-256 Hash for example? You can shorten it later to fit your needs.


If the site is sufficiently large, and I mean large - as in "we need hundreds of new IDs assigned per second" (which will have other problems, e.g. exhaust the 36^6 keyspace under a year), you could pre-generate the keys and assign them.

The following is a pseudocode example - in such a large-traffic site, you'd probably need to optimize the algorithms used.

Pre-generating is trivial - just start at 000000 and go all the way to zzzzzz, insert each row into a table and mark them unused:

 ID     | used
==============
 000000   0   
 000001   0   
 ...
 zzzzzz   0   

When you get a request for new ID, select a random one and mark it used (danger! concurrency issues!):

SELECT ID FROM ids WHERE RAND() LIMIT 1
UPDATE ids SET used = 1, url=what_you_want_shortened WHERE ID = whatever_you_got_from_previous_query

You'll run into efficiency issues (e.g. with WHERE RAND(), table locking, et al.), but it is doable.


If you're not averse to bringing in a .NET dll, I've created a project to do exactly this. Source is on GitHub here, and binaries are in the IdGenerator NuGet Package.

It gives you ordered sequences and/or random values of user-specified length, all in base-36. Ids are universally unique, even with multiple id generator instances or in a distributed environment.

var generator = new Base36IdGenerator(
                numTimestampCharacters: 11, 
                numServerCharacters: 4, 
                numRandomCharacters: 5, 
                reservedValue: "", 
                delimiter: "-", 
                delimiterPositions: new[] {15, 10, 5});

// This instance would give you a 20-character Id, with an
// 11-character timestamp, 4-character servername hash, 
// and 5 character random sequence. This gives you 1.6 million
// hash combinations for the server component, and 60 million
// possible combinations for the random sequence. Timestamp is
// microseconds since epoch:
Console.WriteLine(generator.NewId());
// 040VZC6SL01003BZ00R2

// Argument name specified for readability only:
Console.WriteLine(generator.NewId(delimited: true));
// 040VZ-C6SL0-1003B-Z00R2

Of course if you're more interested in the string not being guessable than in having an ordered sequence, you could just use the GetRandomString method:

// 6-characters give you a little over 2 billion possible 
// combinations (36 ^ 6). 7 characters is about 78 billion.
Console.WriteLine(generator.GetRandomString(length: 6));

The basic principle behind it is:

  • Get microseconds since epoch (64-bit number)
  • Get random number (64-bit) between 0 and (36 ^ desired length) (13 max)
  • Get a servername hash (simple Sha1)
  • Convert each component to base-36 string
  • Pad left with 0 to desired length

The Base36 encoder itself is from http://www.stum.de/2008/10/20/base36-encoderdecoder-in-c/.

0

精彩评论

暂无评论...
验证码 换一张
取 消