More efficient way to remove items from large data sets_问答_开发者

More efficient way to remove items from large data sets

开发者 https://www.devze.com 2023-03-28 03:18 出处：网络

I have two large lists: a = [[\'abcdefghijklmno\', \'foo\', \'bar\'], … ] b = [[\'abcdefghij12345\', \'foo\', \'bar\'], … ]

相关专题：python

I have two large lists:

a = [['abcdefghijklmno', 'foo', 'bar'], … ]
b = [['abcdefghij12345', 'foo', 'bar'], … ]

I'm interested in all members of a which don't have a corresponding entry in b, and vice versa, based on comparing a[n][0] and b[n][0] for all n in a and b. I create two sets of these sublist items, which allows me to do set_a.difference(set_b), and vice versa, which is very fast. But creating two lists based on the remaining items in a and b is (perhaps obviously) slower:

def remaining(ls ,y, z):
    return [i for i in ls if i[0] in y.difference(z)]

where ls is either a or b, and y and z are the two sets detailed above. Is there an开发者_StackOverflow中文版y point in rethinking the structure of a and b to speed this up (e.g. using dicts with a[0] and b[0] values as the keys?

I suspect that your test in the list comprehension is calling y.difference for each element. Try this:

def remaining(ls, y, z):
    diff = y.difference(z)
    return filter(lambda i: i[0] in diff, ls)

At least def remaining(ls ,y, z): should be rewritten in def remaining(ls, common_set):.

Consider the next idea: wrap ['abcdefghijklmno', 'foo', 'bar'] in an object (probably with __slots__) and define its __hash__ using only 'abcdefghijklmno' value. After that you will be able to do set(a) - set(b) and get you task solved.