开发者

is there a data structure equivalent to a non-unique set in python?

开发者 https://www.devze.com 2023-02-25 19:02 出处:网络
I have a very large list of lists of integers which I would like to \"hash()\" to improve search speed.The resulting hashed value for each of the nested lists needs to be independent of the order of t

I have a very large list of lists of integers which I would like to "hash()" to improve search speed. The resulting hashed value for each of the nested lists needs to be independent of the order of the integers and only on the values in the list. This suggested a (frozen) set as a suitable data structure to hash. However, I need to keep every integer val开发者_高级运维ue, whether or not duplicated, which is a show-stopper for sets.

So, that left me ordering the list, converting to a tuple and hashing which is quite slow and I imagine there are better strategies.

I'd appreciate any suggestions on how to do this more efficiently.


Dictionaries are hashes.

>>> def bag_random(d, n):
...     x = random.randint(0, n)
...     if x in d:
...             d[x] += 1
...     else:
...             d[x] = 1
... 
>>> a = {}
>>> for i in xrange(10**6):
...     bag_random(a, 100)
... 
>>> a
{0: 9856, 1: 9787, 2: 9944, 3: 9822, 4: 9978, 5: 9915, 6: 9915, 7: 9860, 8: 9926, 9: 9756, 10: 9914, 11: 10030, 12: 10009, 13: 9803, 14: 9918, 15: 10136, 16: 9818, 17: 9868, 18: 9893, 19: 9971, 20: 9998, 21: 9982, 22: 9884, 23: 9806, 24: 9998, 25: 9926, 26: 9977, 27: 10011, 28: 10030, 29: 9899, 30: 9808, 31: 9825, 32: 9980, 33: 9812, 34: 9928, 35: 9827, 36: 9934, 37: 9883, 38: 9913, 39: 9893, 40: 9822, 41: 9714, 42: 9871, 43: 9954, 44: 9989, 45: 9694, 46: 9878, 47: 9984, 48: 9893, 49: 9928, 50: 10093, 51: 9881, 52: 9828, 53: 9660, 54: 9884, 55: 9745, 56: 10048, 57: 9845, 58: 9916, 59: 9933, 60: 9944, 61: 9979, 62: 9992, 63: 9635, 64: 9811, 65: 9900, 66: 9950, 67: 9744, 68: 9829, 69: 10037, 70: 9929, 71: 9811, 72: 9830, 73: 10056, 74: 9957, 75: 9992, 76: 9777, 77: 9942, 78: 9809, 79: 9734, 80: 9855, 81: 10021, 82: 9914, 83: 10009, 84: 10018, 85: 9961, 86: 10036, 87: 9849, 88: 9951, 89: 9770, 90: 9795, 91: 9899, 92: 9975, 93: 9935, 94: 10037, 95: 9992, 96: 9868, 97: 10014, 98: 9689, 99: 9883, 100: 9878}

Took a second or so on a not wildly fast desktop.


Your data structure is fine, what you need is a way to compute a hash over it that meets your requirements: order of integers doesn't matter, but duplicates need to be respected, and it needs to be fast.

How about computing the product of the numbers? The resulting number will work well as a hash. You can keep it within a 32-bit integer if you want to avoid slipping into longs that will slow you down. The only problem is zeros, but you can skip them, it won't break the hash, only make it less discriminating.

LIMIT = 999999937 # largest 9-digit prime
def list_hash(l):
    h = 1
    for i in l:
        if i:
            h *= i
            h %= LIMIT
    return h


Create a dictionary with the counts. (If your version of python is new enough you can use the Counter class instead. Construct a set out of the items list and hash that

counter = collections.defaultdict(int)
for item in items:
     counter[item] += 1
return hash(frozenset(counter.items()))

But I don't know that it will be any more efficient that you've already done.

Since this is a hash, it doesn't need to represent the fact that some of your numbers are duplicated. So you could just use:

return hash(frozenset(items))


Using Counters as Winston Ewert suggested has other perks. Not only can you naturally have a data structure that you described (Counter inherits from dict and is therefore hashed), but you can also do some neat stuff, like arithmetic that would probably be useful for your case.

from collections import Counter
set1 = Counter("hello")
set2 = Counter("hell")

print(set1)
>> Counter({'l': 2, 'h': 1, 'e': 1, 'o': 1})

print(set1)
>> Counter({'l': 2, 'h': 1, 'e': 1})

set1 - set2
>> Counter({'o': 1})
0

精彩评论

暂无评论...
验证码 换一张
取 消