I have a large file (5Gb) called my_file
. I have a list called my_list
. What is the most efficient way to read each line in the file and, if an item from my_list
matches an item from a line in my_file
, create a new list called matches
that contains items from the lines in my_file
AND items from my_list
where a match occurred. Here is what I am trying to do:
def calc(my_file, my_list)
matches = []
my_file.seek(0,0)
for i in my_file:
i = list(i开发者_StackOverflow.rstrip('\n').split('\t'))
for v in my_list:
if v[1] == i[2]:
item = v[0], i[1], i[3]
matches.append(item)
return matches
here are some lines in my_file
:
lion 4 blue ch3
sheep 1 red pq2
frog 9 green xd7
donkey 2 aqua zr8
here are some items in my_list
intel yellow
amd green
msi aqua
The desired output, a list of lists, in the above example would be:
[['amd', 9, 'xd7'], ['msi', 2, 'zr8']]
My code is currently work, albeit really slow. Would using a generator or serialization help? Thanks.
You could build a dictonary for looking up v. I added further little optimizations:
def calc(my_file, my_list)
vd = dict( (v[1],v[0]) for v in my_list)
my_file.seek(0,0)
for line in my_file:
f0, f1, f2, f3 = line[:-1].split('\t')
v0 = vd.get(f2)
if v0 is not None:
yield (v0, f1, f3)
This should be much faster for a large my_list
.
Using get
is faster than checking if i[2]
is in vd
+ accessing vd[i[2]]
For getting more speedup beyond these optimizations I recommend http://www.cython.org
Keep the items in a dictional rather than a list (let's call it items
). Now iterate through your file as you're doing and pick out the key to look for (i[2]
) and then check if it's there in the in items
.
items would be.
dict (yellow = "intel", green = "amd", aqua = "msi")
So the checking part would be.
if i[2] in items:
yield [[items[i[2]], i[1], i[3]]
Since you're just creating the list and returning it, using a generator might help memory characteristics of the program rather than putting the whole thing into a list and returning it.
There isn't much you can do with the overheads of reading the file in, but based on your example code, you can speed up the matching by storing your list as a dict (with the target field as the key).
Here's an example, with a few extra optimisation tweaks:
mylist = {
"yellow" : "intel",
"green" : "amd",
# ....
}
matches = []
for line in my_file:
i = line[:-1].split("\t")
try: # faster to ask for forgiveness than permission
matches.append([mylist[i[2]], i[1], i[3]])
except NameError:
pass
But again, do note that most of your performance bottleneck will be in the reading of the file and optimisation at this level may not have a big enough impact on the runtime.
Here's a variation on @rocksportrocker's answer using csv
module:
import csv
def calc_csv(lines, lst):
d = dict((v[1], v[0]) for v in lst) # use dict to speed up membership test
return ((d[f2], f1, f3)
for _, f1, f2, f3 in csv.reader(lines, dialect='excel-tab')
if f2 in d) # assume that intersection is much less than the file
Example:
def test():
my_file = """\
lion 4 blue ch3
sheep 1 red pq2
frog 9 green xd7
donkey 2 aqua zr8
""".splitlines()
my_list = [
("intel", "yellow"),
("amd", "green"),
("msi", "aqua"),
]
res = list(calc_csv(my_file, my_list))
assert [('amd', '9', 'xd7'), ('msi', '2', 'zr8')] == res
if __name__=="__main__":
test()
精彩评论