开发者

Filtering two lists by comparing all items to eachother with numpy or tabular

开发者 https://www.devze.com 2023-03-09 02:02 出处:网络
i have two lists of tuples, where tuples in the each list are all unique. lists have the following format:

i have two lists of tuples, where tuples in the each list are all unique. lists have the following format:

[('col1', 'col2', 'col3', 'col4'), ...]

i'm using a nested loop to find the members from both lists that have the same values for given cols, col2 and col3

temp1 = set([])
temp2 = set([])
for item1 in list1:
    for item2 in list2:
        if item1['col2'] == item2['col2'] and \
            item1['col3'] == item2['col3']:
            temp1.add(item1)
            temp2.add(item2)

simply working. but it takes many minutes to complete when there are tens of thousands of items in lists.

Using tabular, i can filter list1 agianst col2, col3 of one item for list2 as given below:

list1 = tb.tabular(records=[...], names=['col1','col2','col3','col4'])
...

for (col1, col2, col3, col4) in list2:
    list1[(list1['col2'] == col2) & (list1['col3'] == col3)]    

which is obviously "doing it wrong" and way much slower than the first.

how can i effectively check 开发者_StackOverflow社区items of a list of tuples against all the items of another using numpy or tabular?

thanks


Try this:

temp1 = set([])
temp2 = set([])

dict1 = dict()
dict2 = dict()

for key, value in zip([tuple(l[1:3]) for l in list1], list1):
    dict1.setdefault(key, list()).append(value)

for key, value in zip([tuple(l[1:3]) for l in list2], list2):
    dict2.setdefault(key, list()).append(value)

for key in dict1:
    if key in dict2:
        temp1.update(dict1[key])
        temp2.update(dict2[key])

Dirty one, but should work.


"how can i effectively check items of a list of tuples against all the items of another using numpy or tabular"

Well, I have no experience with tabular, and very little with numpy, so I can't give you an exact "canned" solution. But I think I can point you in the right direction. If list1 is length X and list2 is length Y, you're making X * Y checks...while you only need to make X + Y checks.

You should do something like the following (I'm going to pretend these are lists of regular Python tuples - not tabular records - I'm sure you can make the necessary adjustments):

common = {}
for item in list1:
    key = (item[1], item[2])
    if key in common:
        common[key].append(item)
    else:
        common[key] = [item]

first_group = []
second_group = []
for item in list2:
    key = (item[1], item[2])
    if key in common:
        first_group.extend(common[key])
        second_group.append(item)

temp1 = set(first_group)
temp2 = set(second_group)


I'd create a subclass of tuple which has special __eq__ and __hash__ methods:

>>> class SpecialTuple(tuple):
...     def __eq__(self, t):
...             return self[1] == t[1] and self[2] == t[2]
...     def __hash__(self):
...             return hash((self[1], self[2]))
... 

It compares col1 and col2 and says the tuple are equal at the condition this columns are identicals.

Then filtering is just using set intersection on this special tuples:

>>> list1 = [ (0, 1, 2, 0), (0, 3, 4, 0), (1, 2, 3, 12) ]
>>> list2 = [ (0, 1, 1, 0), (0, 3, 9, 9), (42, 2, 3, 12) ]
>>> set(map(SpecialTuple, list1)) & set(map(SpecialTuple, list2))
set([(42, 2, 3, 12)])

I don't know how fast it is. Tell me. :)

0

精彩评论

暂无评论...
验证码 换一张
取 消