marking duplicates in a csv file_问答_开发者_运维开发者技术经验分享

I'm stumped with a problem illustrated in the sample below:

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,,
2,"PETER",6232,,
3,"JON",12345,,
4,"PETERSON",6232,,
5,"ALEX",7854,,
6,"JON",12345,,

I want to detect duplicates in column "PHONE", and mark the subsequent duplicates using the column "REF", with a value pointing to the "ID" of the first item and the value "Yes" for the "DISCARD" column

"ID","NAME","PHONE","REF","DISCARD"
1,"JOHN",12345,1,
2,"PETER",6232,2,
3,"JON",12345,1,"Yes"
4,"PETERSON",6232,2,"Yes"
5,"ALEX",7854,,
6,"JON",12345,1,"Yes"

So, how do I go about it? I tried this code but my logic wasn't right, of course.

import csv
myfile = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")
myfile1 = open("C:\Users\Eduardo\Documents\TEST2.csv", "rb")

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTF开发者_如何学JAVAIXED.csv", "wb"), dialect="excel")

reader = csv.reader(myfile)
verum = list(reader)
verum.sort(key=lambda x: x[2])
for i, row in enumerate(verum):
    if row[2] == verum[i][2]:
        verum[i][3] = row[0]

print verum

Your direction and help would be much appreciated.

The only thing you have to keep in memory while this is running is a map of phone numbers to their IDs.

map = {}
with open(r'c:\temp\input.csv', 'r') as fin:
    reader = csv.reader(fin)
    with open(r'c:\temp\output.csv', 'w') as fout:
        writer = csv.writer(fout)
        # omit this if the file has no header row
        writer.writerow(next(reader))
        for row in reader:
            (id, name, phone, ref, discard) = row
            if map.has_key(phone):
                ref = map[phone]
                discard = "YES"
            else:
                map[phone] = id
            writer.writerow((id, name, phone, ref, discard))

Sounds like homework. Since this is a CSV file (and thus changing the record size is next to impossible) you are best off loading the whole file into memory and manipulating it there before writing it out to a new file. Create a list of strings which is the original lines of the file. Then create a map, insert into the the phone number (the key) and the value (the id). Before the insert you look for the number if it already exists, you update the line containing the duplicate phone number. If it isn't already in the map, you insert the (phone, id) pair.

I know one thing. I know you don't have to read the entire file into memory to accomplish this.

import csv
myfile = "C:\Users\Eduardo\Documents\TEST2.csv"

dest = csv.writer(open("C:\Users\Eduardo\Documents\TESTFIXED.csv", "wb"), dialect="excel")

phonedict = {}

for row in cvs.reader(open(myfile, "r")):
    # setdefault sets the value to the second argument if it hasn't been set, and then
    # returns what the value in the dictionary is.
    firstid = phonedict.setdefault(row[2], row[0])
    row[3] = firstid
    if firstid is not row[0]:
       row[4] = "Yes"
    dest.writerow(row)

from operator import itemgetter
from itertools import groupby

import csv
verum = csv.reader(open('data.csv','rb'))

verum.sort(key=itemgetter(2,0))
def grouper( verum ):
    for key, grp in groupby(verum,itemgetter(2)):
        # key = phone number, grp = records with that number
        first = grp.next()
        # first item gets its id written into the 4th column
        yield [first[0],first[1],first[2],first[0],''] #or list(itemgetter(0,1,2,0,4)(first)) 
        for x in grp:
            # all others get the first items id as ref
            yield [x[0],x[1],x[2], first[0], "Yes"]

for line in sorted(grouper(verum), key=itemgetter(0)):
    print line

Outputs:

['1', 'JOHN', '12345', '1', '']
['2', 'PETER', '6232', '2', '']
['3', 'JON', '12345', '1', 'Yes']
['4', 'PETERSON', '6232', '2', 'Yes']
['5', 'ALEX', '7854', '5', '']
['6', 'JON', '12345', '1', 'Yes']

Writing the data back is left to the reader ;-)

I work with large 40k plus record csv files, the easiest way to get rid of dupes it with Access. 1. Create new database, 2, Tables tab Get external Data 3. Save Table. 4. Queries tab New find dupe wizard ( Match on phone field, show all fields and count) 5. Save Query ( export has .txt but name dupes.txt ) 6. Import Query result as new table, do not import field with dupe count.. 7. Query Find unmatched (match by phone field, show all fields in result. save query then Export has .txt but name unique.txt) 8. Import unique file in to existing table ( dupes ) 9.You can now save and export again into what ever files type you need and not have any dupes