I have a large data set which I know contains many dupicate records. Basically I have data on first name, last name, different address components and date of birth.
I think the best way to do this is to use the name and date of birth as chances are if these things match, it's the same person. There are probably lots of instances where there are slight differences in spelling (like typos missing a single letter) or use of 开发者_C百科name (ie: some might have a middle initial in first name column) which would be good to account for, but I'm not sure how to approach this.
Are there any tools or articles on going about this process? The data is all in a MySQL database and I have a basic proficiency in SQL.
You could get a sense of how much dedupe you have to do by something like:
select birthDate,last_name,soundex(first_name),count(*)
from table
group by birthDate,last_name,soundex(first_name)
having count(*) >1
This will list the people with the same birthdate, last_name, and similar first names. Soundex() isn't great, but this could help you get a sense of amount of deduping.
This query below would allow you get the alphabetical first first_name from the table of similar named people. Hopefully this will give you some rough starting ideas//
select birthDate,last_name,soundex(first_name),min(first_name)
from table
group by birthDate,last_name,soundex(first_name)
having count(*) >1
With the second query, you could remove all occurrences of additional names, by using a DELETE where name not in, but that assumes you are willing to keep the lowest first_name and remove the rest...
精彩评论