Can I make uuid's more random?_问答_开发者_运维开发者技术经验分享

I have a program that dispatches messages to separate processes. I need to balance the load, but not in very precise way, almost the same number is ok. Since every message has an uuid field, I want to do it by uuid value. Af开发者_JAVA技巧ter I tested the uuid randomness I found it to not be as random as I expexted. I have the last one and the first one about 80% difference. This is unacceptable, so I want to know if there is an algorithm that can make it more random.

Here is my test code.

import uuid
from collections import Counter

COUNT = 3000

def b(length):
    holder = []
    for i in xrange(COUNT):
        holder.append(str(uuid.uuid4())[:length])
    return Counter(holder)

def num(part_count):
    sep = 0xffffffffffffffffffffffffffffffff / part_count
    parts = []
    for i in xrange(COUNT):
#        str_hex = str(uuid.uuid4())[:4]
        num = int(uuid.uuid4().hex,16)
        divide = num/sep
        if divide == part_count:
            divide = part_count - 1
        parts.append(divide)
    return Counter(parts)

if __name__ == "__main__":
    print num(200)

and I get the output like this:

Counter({127L: 29, 198L: 26, 55L: 25, 178L: 24, 184L: 24, 56L: 23, 132L: 23, 143L: 23, 148L: 23, 195L: 23, 16L: 21, 30L: 21, 44L: 21, 53L: 21, 97L: 21, 158L: 21, 185L: 21, 13L: 20, 146L: 20, 149L: 20, 196L: 20, 2L: 19, 11L: 19, 15L: 19, 19L: 19, 46L: 19, 58L: 19, 64L: 19, 68L: 19, 70L: 19, 89L: 19, 112L: 19, 118L: 19, 128L: 19, 144L: 19, 156L: 19, 192L: 19, 27L: 18, 41L: 18, 42L: 18, 51L: 18, 54L: 18, 85L: 18, 87L: 18, 88L: 18, 93L: 18, 94L: 18, 104L: 18, 106L: 18, 115L: 18, 4L: 17, 22L: 17, 45L: 17, 59L: 17, 79L: 17, 81L: 17, 105L: 17, 125L: 17, 138L: 17, 150L: 17, 159L: 17, 167L: 17, 194L: 17, 3L: 16, 18L: 16, 28L: 16, 31L: 16, 33L: 16, 62L: 16, 65L: 16, 83L: 16, 111L: 16, 123L: 16, 126L: 16, 133L: 16, 145L: 16, 147L: 16, 163L: 16, 166L: 16, 183L: 16, 188L: 16, 190L: 16, 5L: 15, 6L: 15, 9L: 15, 23L: 15, 26L: 15, 34L: 15, 35L: 15, 38L: 15, 69L: 15, 73L: 15, 74L: 15, 77L: 15, 82L: 15, 86L: 15, 107L: 15, 108L: 15, 109L: 15, 110L: 15, 114L: 15, 136L: 15, 141L: 15, 142L: 15, 153L: 15, 160L: 15, 169L: 15, 176L: 15, 180L: 15, 186L: 15, 0L: 14, 1L: 14, 36L: 14, 39L: 14, 43L: 14, 60L: 14, 71L: 14, 72L: 14, 76L: 14, 92L: 14, 113L: 14, 131L: 14, 135L: 14, 157L: 14, 171L: 14, 172L: 14, 181L: 14, 189L: 14, 7L: 13, 17L: 13, 20L: 13, 24L: 13, 25L: 13, 32L: 13, 47L: 13, 49L: 13, 101L: 13, 102L: 13, 117L: 13, 121L: 13, 122L: 13, 124L: 13, 130L: 13, 151L: 13, 152L: 13, 165L: 13, 179L: 13, 14L: 12, 21L: 12, 29L: 12, 50L: 12, 63L: 12, 67L: 12, 80L: 12, 84L: 12, 90L: 12, 91L: 12, 96L: 12, 120L: 12, 129L: 12, 139L: 12, 140L: 12, 182L: 12, 193L: 12, 197L: 12, 52L: 11, 75L: 11, 78L: 11, 103L: 11, 116L: 11, 119L: 11, 134L: 11, 137L: 11, 161L: 11, 173L: 11, 12L: 10, 37L: 10, 66L: 10, 98L: 10, 100L: 10, 162L: 10, 170L: 10, 175L: 10, 177L: 10, 187L: 10, 191L: 10, 199L: 10, 48L: 9, 155L: 9, 164L: 9, 174L: 9, 10L: 8, 95L: 8, 99L: 8, 168L: 8, 8L: 7, 40L: 7, 57L: 7, 61L: 7, 154L: 6})

the last one is 6 the first one is 29, nearly 5 times difference

UUIDs are not meant to be random, just unique. If your balancer needs to be keyed off of them, it should run them through a hash function first to get the randomness you want:

import hashlib
actually_random = hashlib.sha1(uuid).digest()

Your testing methodology doesn't make any sense (see below). But first, this is the implementation of uuid4:

def uuid4():
    """Generate a random UUID."""

    # When the system provides a version-4 UUID generator, use it.
    if _uuid_generate_random:
        _buffer = ctypes.create_string_buffer(16)
        _uuid_generate_random(_buffer)
        return UUID(bytes=_buffer.raw)

    # Otherwise, get randomness from urandom or the 'random' module.
    try:
        import os
        return UUID(bytes=os.urandom(16), version=4)
    except:
        import random
        bytes = [chr(random.randrange(256)) for i in range(16)]
        return UUID(bytes=bytes, version=4)

And the randomness returned by libuuid (the ctypes call), os.urandom and random.randrange should be good enough for most non-crypto stuff.

Edit: Ok, my guess as to why your testing methodology is broken: the number you're counting (divide) is biased in two ways: first, it's the result of dividing by a number which isn't a power of two (in this case, 200), which introduces modulo bias. Second, the if divide == part_count: divide = part_count - 1 introduces more bias.

Additionally, you'll need to figure out what the confidence interval is for any random number generator test before you can interpret the results. My stats-foo isn't great here, though, so I can't really help you with that…

Well, UUID is not supposed to be random, it's supposed to be unique : usually, it's based on computer name/ip, date, stuff like that : the goal is not to make it random, the goal is to make sure that two successive calls will provide two different values and that Id from different computers won't collide. If you want more details, you can look at official spec (RFC 4122)

Now, if your load balancer want to use that as a criteria for balancing, I think your design is flawed. If you want a better randomness out of it, you can hash it (like sha-256), thus diluting the little randomness amongst all the bits (that's what a hash is doing)

Only because something doesn't look random, doesn't mean it isn't.

Maybe to the human eye (and mind) some sequences look less random than others, they are not. When you roll a dice 10 times, the probability to roll 2-5-1-3-5-1-3-5-2-6 is as high as rolling 1-1-1-1-1-1-1-1-1-1 or 1-2-3-4-5-6-1-2-3-4. Although the two latter examples seem to be less random, they are not.

Do not try to improve random generators as most probably you will only worsen the output.

For instance: You want to generate a random sequence and it doesn't look random enough to you that one byte appears more frequently than another. Hence you dismiss all sequences with repeated bytes ( or bytes repeated more than n times) in order to assure more randomness. Actually, you are making your sequences less random.