Given: Two csv files (1.8 MB each): AllData_1, AllData_2. Each with ~8,000 lines. Each line consists of 8 columns. [txt_0,txt_1,txt_2,txt_3,txt_4,txt_5,txt_6,txt_7,txt_8]
Goal: Based on a match of txt_0 (or, AllData_1[0] == AllData_2 ), compare the contents of the next 4 columns for these individual rows. If the data is unequal, put the entire row for each set of data in an list based on the column being dif开发者_StackOverflowferent and save lists to output file. If txt_0 is one data set but not the other, then save that directly to the output file.
Example:
AllData_1 row x contains: [a1, b2, c3, d4, e5, f6, g7, h8] AllData_2 row y contains: [a1, b2, c33c, d44d, e5, f6, g7, h8]
Program saves all of row x and y in lists corresponding to ListCol2 and ListCol3. After all comparing is finished, the lists are saved to file.
How can I make my code faster or change my code to a faster algorithm?
i = 0
x0list = []
y0list = []
col1_diff = col2_diff = col3a_diff = col3b_diff = col4_diff = []
#create list out of column 0
for y in AllData_2:
y0list.append(y[0])
for entry in AllData_1:
x0list.append(entry[0])
if entry[0] not in y0list:
#code to save the line to file...
for y0 in AllData_2:
if y0[0] not in x0list:
#code to save the line to file...
for yrow in AllData_2:
i+=1
for xrow in AllData_1:
foundit = 0
if yrow[0] == xrow[0] and foundit == 0 and (yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4]):
if yrow[1] != xrow[1]:
col1_diff.append(yrow)
col1_diff.append(xrow)
foundit = 1
elif yrow[2] != xrow[2]:
col2_diff.append(yrow)
col2_diff.append(xrow)
foundit = 1
elif len(yrow[3]) < len(xrow[3]):
col3a_diff.append(yrow)
col3a_diff.append(xrow)
foundit = 1
elif len(yrow[3]) >= len(xrow[3]):
col3b_diff.append(yrow)
col3b_diff.append(xrow)
foundit = 1
else:
#col4 is actually a catch-all for any other differences between lines if [0]s are equal
col4_diff.append(yrow)
col4_diff.append(xrow)
foundit = 1
Right of the top, you can make this a lot smaller.
y0list = []
for y in AllData_2:
y0list.append(y[0])
is just a verbose way of saying
y0list = [y[0] for y in AllData_2]
And you can use in builtin comparisons. The below
(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])
can be expressed as
yrow[1:] != xrow[1:]
which is much less prone to copy/paste errors.
To make it faster, you can avoid doing O(n**2) comparisons. Since you only care when the first column element is the same, you can just bundle them by first element.
index = {}
for yrow in AllData_2:
key = yrow[0]
list = index.get(key)
if list is None:
list = []
index[key] = list
list.append(yrow)
for xrow in AllData_1:
list = index.get(xrow[0])
if list is None: continue
for yrow in list:
# Do all your comparison here
If you can expect no two lines in a given file to have the same data in column 0, you can significantly improve your code with a few dict
s. Instead of the lines
x0list.append(entry[0])
y0list.append(y[0])
You would use:
x0dict[entry[0]] = entry
y0dict[y[0]] = y
after initializing x0dict
and y0dict
to {}
. Then, instead of looping through both complete sets of data again, you can loop over just one of the dicts:
for x0, xrow in x0dict:
if x0 in y0dict:
yrow = y0dict[x0]
# Do the col{1,2,3,4}_diff stuff here
As a bonus, the not in
in your second and third loops works the same.
The line
(yrow[1] != xrow[1] or yrow[2] != xrow[2] or yrow[3] != xrow[3] or yrow[4] != xrow[4])
can be replaced with the nicer-looking
yrow[1:5] != xrow[1:5]
As your code stands right now, i
is never used, but if you need that count, it ends up being identical to just saying i = len(AllData_2)
, since it only increments once per run in a loop over AllData_2
.
Finally, your foundit
variable currently serves no purpose. It is only used to control the flow with foundit == 0
, immediately after setting it to 0, so that will always evaluate to True
and setting it has no effect.
精彩评论