I am about to remove duplicates from my database using
delete from table
where id not in (
select min(id)
from table
group by foreign_key);
However, I would like to do so with the following conditions:
- if any of the duplicate rows have a value in
fieldA
orfieldB
- if any of the duplicates have a value for
fieldA
orfieldB
, but there is only one unique value in each field, keep that value - if there is more than one unique value in each row, report this information along with the
id
andforeign_key
so that the value can be fixed manually.
- if any of the duplicates have a value for
By default, fieldA
and fieldB
are NULL
, but data have been entered in these fields in some cases.
Here is some sample data:
| id | foreign_key | fieldA | fieldB |
|----+-------------+--------+--------|
| 1 | 1 | NULL | NULL |
| 2 | 1 | A1 | B1 |
| 3 | 1 | N开发者_开发技巧ULL | NULL |
| 4 | 2 | A2 | B2 |
| 5 | 2 | A3 | B2 |
| 6 | 3 | NULL | NULL |
| 7 | 4 | A4 | B4 |
| 8 | 5 | A5 | NULL |
| 9 | 5 | NULL | B5 |
| 10 | 6 | A6 | B6 |
| 11 | 6 | A7 | B6 |
| 12 | 7 | NULL | B7 |
| 13 | 7 | NULL | B7 |
What I want to keep is:
| id | foreign_key | fieldA | fieldB |
|----+-------------+--------+--------|
| 2 | 1 | A1 | B1 |
| 4 | 2 | NULL | B2 |
| 6 | 3 | NULL | NULL |
| 7 | 4 | A4 | B4 |
| 8 | 5 | A5 | B5 |
| 10 | 6 | NULL | B6 |
| 12 | 7 | NULL | B7 |
And I would like this information to be returned:
foreign_key 2 has two distinct values of fieldA: A2 and A3
I've got to run of right now, but here's a query to start with:
SELECT id, foreign_key,
group_concat(DISTINCT fieldA) as A, count(DISTINCT fieldA) as `#A`,
group_concat(DISTINCT fieldB) as B, count(DISTINCT fieldB) as `#B`
FROM t1
GROUP BY foreign_key
;
On the test data, this returns:
| id | foreign_key | A | #A | B | #B | +----+-------------+-------+----+------+----+ | 1 | 1 | A1 | 1 | B1 | 1 | | 4 | 2 | A2,A3 | 2 | B2 | 1 | | 6 | 3 | NULL | 0 | NULL | 0 | | 7 | 4 | A4 | 1 | B4 | 1 | | 8 | 5 | A5 | 1 | B5 | 1 | | 10 | 6 | A6,A7 | 2 | B6 | 1 | | 12 | 7 | NULL | 0 | B7 | 1 |
Query for rows to keep:
SELECT id, foreign_key,
group_concat(DISTINCT fieldA) as A, count(DISTINCT fieldA) as `#A`,
group_concat(DISTINCT fieldB) as B, count(DISTINCT fieldB) as `#B`
FROM t1
GROUP BY foreign_key
HAVING `#A` < 2 AND `#B` < 2
;
Query for rows that need operator intervention:
SELECT id, foreign_key,
group_concat(DISTINCT fieldA) as A, count(DISTINCT fieldA) as `#A`,
group_concat(DISTINCT fieldB) as B, count(DISTINCT fieldB) as `#B`
FROM t1
GROUP BY foreign_key
HAVING `#A` >= 2 OR `#B` >= 2
;
GROUP_CONCAT
may not be suitable, depending on the format of data stored in the columns. In combination with #A and #B, however, you could detect when it's not suitable, so it shouldn't be a big problem. It may also have too big a performance impact, but I can't think of another aggregate function that could be used in the same way (a GROUP_COALESCE
would be nice).
精彩评论