Removing duplicate SQL records to permit a unique key_问答_开发者

I have a table ('sales') in a MYSQL DB which should rightfully have had a unique constraint enforced to prevent duplicates. To first remove the dupes and set the constraint is proving a bit tricky.

Table structure (simplified):

'id (unique, autoinc)'
product_id

The goal is to enforce uniqueness for product_id. The de-duping policy I want to apply is to remove all duplicate records except the most recently created, eg: the highest id.

Or to put another way, I would like to delete only duplicate records, excluding the ids matched by the following query whilst also preserving the existing non-duped records:

select id 
  from sales s  
inner join (select product_id, 
                   max(id) as maxId 
              from sales 
          group by product_id 
            having count(product_id)  > 1) groupedByProdId on s.product_id 
                                                          and s.id = groupedByProdId.maxId

I've struggled with this on two fronts - writing the query to select the correct records to delete and then also the constraint in MYSQL where开发者_StackOverflow a subselect FROM clause of a DELETE cannot reference the same table from which data is being removed.

I checked out this answer and it seemed to deal with the subject, but seem specific to sql-server, though I wouldn't rule this question out from duplicating another.

In reply to your comment, here's a query that works in MySQL:

delete YourTable
from YourTable
inner join YourTable yt2
on YourTable.product_id = yt2.product_id
and YourTable.id < yt2.id

This would only remove duplicate rows. The inner join will filter out the latest row for each product, even if no other rows for the same product exist.

P.S. If you try to alias the table after FROM, MySQL requires you to specify the name of the database, like:

delete <DatabaseName>.yt
from YourTable yt
inner join YourTable yt2
on yt.product_id = yt2.product_id
and yt.id < yt2.id;

Perhaps use ALTER IGNORE TABLE ... ADD UNIQUE KEY. For example:

describe sales;
+------------+---------+------+-----+---------+----------------+
| Field      | Type    | Null | Key | Default | Extra          |
+------------+---------+------+-----+---------+----------------+
| id         | int(11) | NO   | PRI | NULL    | auto_increment | 
| product_id | int(11) | NO   |     | NULL    |                | 
+------------+---------+------+-----+---------+----------------+

select * from sales;
+----+------------+
| id | product_id |
+----+------------+
|  1 |          1 | 
|  2 |          1 | 
|  3 |          2 | 
|  4 |          3 | 
|  5 |          3 | 
|  6 |          2 | 
+----+------------+

ALTER IGNORE TABLE sales ADD UNIQUE KEY idx1(product_id), ORDER BY id DESC; 
Query OK, 6 rows affected (0.03 sec)
Records: 6  Duplicates: 3  Warnings: 0


select * from sales;
+----+------------+
| id | product_id |
+----+------------+
|  6 |          2 | 
|  5 |          3 | 
|  2 |          1 | 
+----+------------+

See this pythian post for more information.

Note that the ids end up in reverse order. I don't think this matters, since order of the ids should not matter in a database (as far as I know!). If this displeases you however, the post linked to above shows a way to solve this problem too. However, it involves creating a temporary table which requires more hard drive space than the in-place method I posted above.

I might do the following in sql-server to eliminate the duplicates:

DELETE FROM Sales
FROM Sales
    INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id

It looks like the analogous delete statement for mysql might be:

DELETE FROM Sales 
USING Sales
    INNER JOIN Sales b ON Sales.product_id = b.product_id AND Sales.id < b.id

This type of problem is easier to solve with CTEs and Ranking functions, however, you should be able to do something like the following to solve your problem:

Delete Sales
Where Exists(
            Select 1
            From Sales As S2
            Where S2.product_id = Sales.product_id
                And S2.id > Sales.Id
            Having Count(*) > 0
            )