I have an interesting problem, and my logic isn't up to the task.
We have a table with that sometimes develops duplicate records (for process reasons, and this is unavoidable). Take the following example:
id FirstName LastName PhoneNumber email
-- --------- -------- ------------ --------------
1 John Doe 123-555-1234 jdoe@gmail.com
2 Jane Smith 123-555-1111 jsmith@foo.com
3 John Doe 123-555-4321 jdoe@yahoo.com
4 Bob Jones 123-555-5555 bob@bar.com
5 John Doe 123-555-0000 jdoe@hotmail.com
6 Mike Roberts 123-555-9999 roberts@baz.com
7 John Doe 123-555-1717 wally@domain.com
We find the duplicates this way:
SELECT c1.*
FROM `clients` c1
INNER JOIN (
SELECT `FirstName`, `LastName`, COUNT(*)
FROM `clients`
GROUP BY `FirstName`, `LastName`
HAVING COUNT(*) > 1
) AS c2
ON c1.`FirstName` = c2.`FirstName`
AND c1.`LastName` = c2.`LastName`
This generates the following list of duplicates:
id FirstName LastName PhoneNumber email
-- --------- -------- ------------ --------------
1 John Doe 123-555-1234 jdoe@gmail.com
3 John Doe 123-555-4321 jdoe@yahoo.com
5 John Doe 123-555-0000 jdoe@hotmail.com
7 John Doe 123-555-1717 wally@domain.com
As you can see, based on FirstName
and LastName
, all of the records are duplicates.
At this point, we actually make a phone call to the client to clear up potential duplicates.
After doing so, we learn (for example) that records 1 and 3 are real duplicates, but records 5 and 7 are actually two different people altogether.
So we merge any extraneously linked data from records 1 and 3 into record 1, remove record 3, and leave records 5 and 7 alone.
Now here's were the problem comes in:
The next time we re-run the "duplicates" query, it will contain the following rows:
id FirstName LastName PhoneNumber email
-- --------- -------- ------------ --------------
1 John Doe 123-555-4321 jdoe@gmail.com
5 John Doe 123-555-0开发者_运维百科000 jdoe@hotmail.com
7 John Doe 123-555-1717 wally@domain.com
They all appear to be duplicates, even though we've previously recognized that they aren't.
How would you go about identifying that these records aren't duplicates?
My first though it to build a lookup table identifying which records aren't duplicates of each other (for example, {1,5},{1,7},{5,7}), but I have no idea how to build a query that would be able to use this data.
Further, if another duplicate record shows up, it may be a duplicate of 1, 5, or 7, so we would need them all to show back up in the duplicates list so the customer service person can call the person in the new record to find out which record he may be a duplicate of.
I'm stretched to the limit trying to understand this. Any brilliant geniuses out there that would care to take a crack at this?
Interesting problem. Here's my crack at it.
How about if we approach the problem from a slightly different perspective.
Consider that the system is clean for a start i.e all records currently in the system are either with Unique First + Last name combinations OR the same first + last name ones have already been manually confirmed to be different people.
At the point of entering a NEW user in the system, we have an additional check. Can be implemented as an INSERT Trigger or just another procedure called after the insert is successfully done.
- This Trigger / Procedure matches the FIRST + LAST name combination of "Inserted"record with all existing records in the table.
- For all the matching First + Last names, it will create an entry in a matching table (new table) with NewUserID, ExistingMatchingRecordsUserID
From an SQL perspective,
TABLE MatchingTable
COLUMNS 1. NewUserID 2. ExistingUserID
Constraint : Logical PK = NewUserID + ExistingMatchingRecordsUserID
INSERT INTO MATCHINGTABLE VALUES ('NewUserId', userId)
SELECT userId FROM User u where u.firstName = 'John' and u.LastName = 'Doe'
All entries in MatchingTable need resolution.
When say an Admin logs into the system, the admin sees the list of all entries in MatchingTable
eg: New User John Doe - (ID 345) - 3 Potential matches John Doe - ID 123 ID 231 / ID 256
The admin will check up data for 345 against data in 123 / 231 and 256 and manually confirm if duplicate of ANY / None If Duplicate, 345 is deleted from User Table (soft / hard delete - whatever suits you) If NOT, the entries for ID 354 are just removed from MatchingTable (i would go with hard deletes here as this is like a transactional temp table but again anything is fine).
Additionally, when entries for ID 354 are removed from MatchingTable, all other entries in MatchingTable where ExistingMatchingRecordsUserID = 354 are automatically removed to ensure that unnecessary manual verification for already verified data is not needed.
Again, this could be a potential DELETE trigger / Just logic executed additionally on DELETE of MatchingTable. The implementation is subject to preference.
At the expense of adding a single byte per row to your table, you could add a manually_verified BOOL
column, with a default of FALSE
. Set it to TRUE
if you have manually verified the data. Then you can simply query where manually_verified = FALSE
.
It's simple, effective, and matches what is actually happening in the business processes: you manually verify the data.
If you want to go a step further, you might want to store when the row was verified and who verified it. Since this might be annoying to store in the main table, you could certainly store it in a separate table, and LEFT JOIN
in the verification data. You could even create a view to recreate the appearance of a single master table.
To solve the problem of a new duplicate being added: you would check non-verified data against the entire data set. So that means your main table, c1
, would have the condition manually_verified = FALSE
, but your INNER JOIN
ed table, c2
, does not. This way, the unverified data will still find all potential duplicate matches:
SELECT * FROM table t1
INNER JOIN table t2 ON t1.name = t2.name AND t1.id <> t2.id
WHERE t1.manually_verified = FALSE
The possible matches for the duplicates will be in the joined table.
精彩评论