I'm using hash_ring
package for distributing objects among servers. I've assumed that distribution would be uniform, as it's based on开发者_如何学JAVA MD5 hashes. Unfortunately it's not the case.
I'm using random keys which are generated using uuid.uuid4()
. I've verified, that MD5 itself in fact does give uniform distribution. However, when I'm distributing using hash_ring.HashRing
there are differences of 20-30% between most and least populated buckets.
- Can uniformity of distribution be improved in
hash_ring
by tweaking some settings? - Are there other good alternatives to do consistent hashing in Python?
The code I've used to test uniformity of distribution:
ring = hash_ring.HashRing(range(8))
for _ in range(10):
counters = [0]*8
for _ in range(100000):
counters[ring.get_node(str(uuid.uuid4()))] += 1
print counters
Which printed out:
[11115, 11853, 14033, 11505, 13640, 12292, 12851, 12711]
[11164, 11833, 14024, 11562, 13365, 12302, 13002, 12748]
[11354, 11756, 14017, 11583, 13201, 12231, 13135, 12723]
[11182, 11672, 13936, 11441, 13563, 12240, 13129, 12837]
[11069, 11866, 14045, 11541, 13443, 12249, 12894, 12893]
[11196, 11791, 14158, 11533, 13517, 12319, 13039, 12447]
[11297, 11944, 14114, 11536, 13154, 12289, 12990, 12676]
[11250, 11770, 14145, 11657, 13340, 11960, 13161, 12717]
[11027, 11891, 14099, 11615, 13320, 12336, 12891, 12821]
[11148, 11677, 13965, 11557, 13503, 12309, 13123, 12718]
For comparison I've did non-consistent hashing directly using MD5:
for _ in range(10):
counters = [0]*8
for _ in range(100000):
counters[int(hashlib.md5(str(uuid.uuid4())).hexdigest(),16)%8] += 1
print counters
with much better results:
[12450, 12501, 12380, 12643, 12446, 12444, 12506, 12630]
[12579, 12667, 12523, 12385, 12386, 12445, 12702, 12313]
[12624, 12449, 12451, 12401, 12580, 12449, 12562, 12484]
[12359, 12543, 12371, 12659, 12508, 12416, 12619, 12525]
[12425, 12526, 12565, 12732, 12381, 12481, 12335, 12555]
[12514, 12576, 12528, 12294, 12658, 12319, 12518, 12593]
[12500, 12471, 12460, 12502, 12637, 12393, 12442, 12595]
[12583, 12418, 12428, 12311, 12581, 12780, 12372, 12527]
[12395, 12569, 12544, 12319, 12607, 12488, 12424, 12654]
[12480, 12423, 12492, 12433, 12427, 12502, 12635, 12608]
the hash ring sacrifices the "eveness" of your md5 test code to maintain mappings when the number of entries changes. see http://www.lexemetech.com/2007/11/consistent-hashing.html. so the differences you see are not because of uuid4, or because of an error, but because the library uses a different algorithm from your test.
if you changed the number of bins in your md5 code you'd need to change the modular division (the % 8
) and suddenly (almost) all mappings would change. consistent hashing avoids this. that is why it can't use the same "obviously uniform" approach you do.
the downside of the consistency approach is that it's not perfectly uniform (it depends on the hashes of the the bins, rather than on the hashes of the objects you put in the bins, so you don't get the "evening out" you would expect as you add more objects). but you can increase uniformity by increasing the number of points on the ring - by increasing the number of "replicas" used in the code.
so assuming that the final api matches that shown at http://amix.dk/blog/viewEntry/19367 just set a larger value for replicas
(try doubling it, or even just adding 1 - you're already pretty flat).
update: i looked around a bit more and this blog post http://amix.dk/blog/post/19369 describes the changes made to the latest code. it looks like that does use more replicas than 3 (i don't completely understand the code, sorry) and it also seems that it's now based on a well-known "standard" implementation.
精彩评论