selecting the earliest entry in a list that also satisfies other criteria_问答_开发者

Say I have a list L1, and the entries in L1 have 4 parts and are formatted like this cat1, cat2, date, ID. The list is sorted, so alphabetically by the cat1 entries, then alphabetically by the cat2 entries, then by the earliest date. I want a subset of this list that contains the entry with earliest date for each cat1, cat2 pair. This is the code that I have that does this already:

L1=[A, X, 2008-06-01, 1858
A, X, 2008-12-05, 1905
B, X, 2001-08-08, 1149
B, Y, 2006-03-05, 1638
B, Y, 2009-06-09, 1950
C, X, 2005-12-01, 1611
C, X, 2006-08-08, 1689
C, X, 2006-11-22, 1712
C, X, 2008-04-22, 1842
C, Y, 2008-12-05, 1816
C, Y, 2008-12-05, 1821
C, Y, 2008-12-05, 1882
C, Z, 2008-12-05, 1905
C, Z, 2009-06-01, 1935
C, Z, 2009-06-09, 1950
D, X, 2009-11-06, 1989
D, Y, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
E, X, 2008-12-05, 1905
E, Z, 2008-12-05, 1905
F, Y, 2008-12-05, 1905
G, X, 2008-12-05, 1905
G, Z, 2007-12-01, 1807]

L2=[j.next() for i, j in itertools.groupby(L1, lambda x: x.split(",", 2)[:2])]

L2=[A, X, 2008-06-01, 1858
B, X, 2001-08-08, 1149
B, Y, 2006-03-05, 1638
C, X, 2005-12-01, 1611
C, Y, 2008-12-05, 1816
C, Z, 2008-12-05, 1905
D, X, 2009-11-06, 1989
D, Y, 2008-12-05, 1905
D, Z, 2008-12-05, 1905
E, X, 2008-12-05, 1905
E, Z, 2008-12-05, 1905
F, Y, 2008-12-05, 1905
G, X, 2008-12-05, 1905
G, Z, 2007-12-01, 1807]

The trick now is that I want the earliest entry for each cat1, cat2 pair where the ID is in a list of values in <=3 keys in dict1 AND dict2. In other words, once the earliest entry for the cat1, cat2 pair is found is should be tested in each dict1 and dict2 and if the ID is found to be contained in the value list for 4+ keys of either dictionary, it should go to the next earliest entry for that cat1, cat2 pair, and in order to add the entry to L2 its ID should be in 3 or less keys in both dict1 and dict2. I'm not quite sure how to go about this... maybe use re.search or something?

dict1[key]=[ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID]    
dict2[key]=[ID,ID,ID,ID,ID,ID,ID,ID,ID,ID开发者_如何转开发,ID,ID,ID,ID,ID,ID,ID,ID,ID,ID]

so instead of L2 having just the earliest entry per cat1, cat2 pair, it will contain the earliest entry where the ID from that entry is not among the ID list in 4+ keys in both dict1 AND dict2.

If dict1 and dict2's value-lists are not too big, you could generate the set of all valid IDs first, then filter L1 to contain only those tuples (X,Y,date,ID) whose IDs are in the set of value IDs:

import collections
def valid_ids(*dcts):
    valid=collections.defaultdict(int)
    for dct in dcts:
        for key,value in dct.iteritems():
            valid[value]+=1
    return set(value for value,count in valid.iteritems() if count<=3)

ids=valid_ids(dict1,dict2)

L1_filtered=[text.split(',') for text in L1 if text.split(',')[-1].strip() in ids]
L2 = [j.next() for i, j in itertools.groupby(L1_filtered, lambda x: x.split(",", 2)[:2])]

Note that if dict1 and dict2 have value-lists with a huge number of IDs, then the above method is not ideal, because you will waste a lot of time determining the set of value ids while in forming L2 you may only need a little bit of this data.

Using Hugh Bothwell's idea, if dict1 and dict2 have large value-lists, then it might pay to just check if particular IDs are valid as needed :

def is_valid(ID,*dcts):    
    return sum(1 for dct in dcts
               for key,value in dct.iteritems()
               if ID in value) <= 3       

L2=[]
for key, group in itertools.groupby(L1, lambda x: x.split(",", 2)[:2]):
    for text in group:
        X,Y,date,ID = text.split(',')
        X = X.strip()
        Y = Y.strip()
        date = date.strip()
        ID = ID.strip()
        if is_valid(ID,dict1,dict2):
            L2.append(X,Y,date,ID)
            break
    else:
        # There is no valid ID for this group!
        continue

Note that if you use the first method, with valid_ids, you loop through the dicts just once. If you use the second method, you loop through the dicts at least once for every group (unique X and Y pairs), and possibly multiple times for each group.

My guess is that the first method will be faster for most data sets, but profiling both methods with your real data is probably the safest way to tell.

I think you need something like

L2 = []
for xy,rem in itertools.groupby(L1, lambda x: x.split(",", 2)[:2]):
    for s in rem:
        date,id = s.split(",")
        if TEST_ID(id):
            L2.append(s)
            break