I am having trouble picking the best data structure for solving a problem.
The problem is as below:
I have a nested list of identity codes where the sublists are of varying length.
li = [['abc', 'ghi', 'lmn'], ['kop'], ['hgi', 'ghy']]
I have a 开发者_C百科file with two entries on each line; an identity code and a number.
abc 2.93 ghi 3.87 lmn 5.96
Each sublist represents a cluster. I wish to select the i.d. from each sublist with the highest number associated with it, append that i.d. to a new list and ultimately write it to a new file.
What data structure should the file with numbers be read in as?
Also, how would you iterate over said data structure to return the i.d. with the highest number that matches the i.d. within a sublist?
Thanks, S :-)
You can read the file into a dictionary (string=>int), then use a list comprehension to get the highest identity code from each sublist.
d = {}
with open("data", 'rb') as data:
for line in data:
key, val = line.split(' ')
d[key] = float(val)
ids = [max(sublist, key=lambda k: d[k]) for sublist in li]
For Python 2.4, use:
ids = []
for sublist in li:
subnums = map(lambda x: d[x], sublist)
ids.append(sublist[subnums.index(max(subnums))])
As noted, this is O(n).
My solution assumes that you only want the highest number and not the id that is associated with it.
I'd read the identity codes and the numbers in a dictionary as suggested by Matthew
NEW_LIST = []
ID2NUM = {}
with file('codes') as codes:
for line in codes:
id, num = line.rstrip().split()
ID2NUM[id] = num
I added some numbers so every id has a value. My ID2NUM
looks like this:
{'abc': 2.9300000000000002,
'ghi': 3.8700000000000001,
'ghy': 1.2,
'hgi': 0.40000000000000002,
'kop': 4.3499999999999996,
'lmn': 5.96}
Then process the list li
:
for l in li:
NEW_LIST.append(max([d[x] for x in l]))
>>> NEW_LIST
[5.96, 4.3499999999999996, 1.2]
To write the new list to a file, one number per line:
with file('new_list', 'w') as new_list:
new_list.write('\n'.join(NEW_LIST))
How about storing each sublist as a binary search tree? You'd get O(log n) search performance on average.
Another option would be to use max-heaps and you'd get O(1) to get the maximum value.
精彩评论