开发者

Find differences between rows with duplicate(s) in one field

开发者 https://www.devze.com 2023-02-12 05:08 出处:网络
I am about to remove duplicates from my database using delete from table where id not in ( select min(id)

I am about to remove duplicates from my database using

delete from table 
  where id not in (
    select min(id) 
      from table 
      group by foreign_key);

However, I would like to do so with the following conditions:

  • if any of the duplicate rows have a value in fieldA or fieldB
    • if any of the duplicates have a value for fieldA or fieldB, but there is only one unique value in each field, keep that value
    • if there is more than one unique value in each row, report this information along with the id and foreign_key so that the value can be fixed manually.

By default, fieldA and fieldB are NULL, but data have been entered in these fields in some cases.

Here is some sample data:

| id | foreign_key | fieldA | fieldB |
|----+-------------+--------+--------|
|  1 |           1 | NULL   | NULL   |
|  2 |           1 | A1     | B1     |
|  3 |           1 | N开发者_开发技巧ULL   | NULL   |
|  4 |           2 | A2     | B2     |
|  5 |           2 | A3     | B2     |
|  6 |           3 | NULL   | NULL   |
|  7 |           4 | A4     | B4     |
|  8 |           5 | A5     | NULL   |
|  9 |           5 | NULL   | B5     |
| 10 |           6 | A6     | B6     |
| 11 |           6 | A7     | B6     |
| 12 |           7 | NULL   | B7     |
| 13 |           7 | NULL   | B7     |

What I want to keep is:

| id | foreign_key | fieldA | fieldB |
|----+-------------+--------+--------|
|  2 |           1 | A1     | B1     |
|  4 |           2 | NULL   | B2     |
|  6 |           3 | NULL   | NULL   |
|  7 |           4 | A4     | B4     |
|  8 |           5 | A5     | B5     |
| 10 |           6 | NULL   | B6     |
| 12 |           7 | NULL   | B7     |

And I would like this information to be returned:

foreign_key 2 has two distinct values of fieldA: A2 and A3


I've got to run of right now, but here's a query to start with:

SELECT id, foreign_key, 
    group_concat(DISTINCT fieldA) as A, count(DISTINCT fieldA) as `#A`,
    group_concat(DISTINCT fieldB) as B, count(DISTINCT fieldB) as `#B`
  FROM t1
  GROUP BY foreign_key
;

On the test data, this returns:

| id | foreign_key | A     | #A | B    | #B |
+----+-------------+-------+----+------+----+
|  1 |           1 | A1    |  1 | B1   |  1 |
|  4 |           2 | A2,A3 |  2 | B2   |  1 |
|  6 |           3 | NULL  |  0 | NULL |  0 |
|  7 |           4 | A4    |  1 | B4   |  1 |
|  8 |           5 | A5    |  1 | B5   |  1 |
| 10 |           6 | A6,A7 |  2 | B6   |  1 |
| 12 |           7 | NULL  |  0 | B7   |  1 |

Query for rows to keep:

SELECT id, foreign_key, 
    group_concat(DISTINCT fieldA) as A, count(DISTINCT fieldA) as `#A`, 
    group_concat(DISTINCT fieldB) as B, count(DISTINCT fieldB) as `#B`
  FROM t1
  GROUP BY foreign_key
  HAVING `#A` < 2 AND `#B` < 2
;

Query for rows that need operator intervention:

SELECT id, foreign_key, 
    group_concat(DISTINCT fieldA) as A, count(DISTINCT fieldA) as `#A`, 
    group_concat(DISTINCT fieldB) as B, count(DISTINCT fieldB) as `#B`
  FROM t1
  GROUP BY foreign_key
  HAVING `#A` >= 2 OR `#B` >= 2
;

GROUP_CONCAT may not be suitable, depending on the format of data stored in the columns. In combination with #A and #B, however, you could detect when it's not suitable, so it shouldn't be a big problem. It may also have too big a performance impact, but I can't think of another aggregate function that could be used in the same way (a GROUP_COALESCE would be nice).

0

精彩评论

暂无评论...
验证码 换一张
取 消

关注公众号