TSQL: Remove duplicates based on max(date)_问答_开发者

I am searching for a query to select the maximum date (a datetime column) and keep its id and row_id. The desire is to DELETE the rows in the source table.

Source Data

id     date         row_id(unique)
1      11/11/2009    1
1      12/11/2009    2
1      13/11/2009    3
2      1/11/2009     4

Expected Survivors

1      13/11/2009    3
2      1/11/2009     4

What query would I开发者_JAVA百科 need to achieve the results I am looking for?

Tested on PostgreSQL:

delete from table where (id, date) not in (select id, max(date) from table group by id);

There are various ways of doing this, but the basic idea is the same:
- Indentify the rows you want to keep
- Compare each row in your table to the ones you want to keep
- Delete any that don't match

DELETE
   [source]
FROM
   yourTable    AS [source]
LEFT JOIN
   yourTable    AS [keep]
      ON  [keep].id = [source].id
      AND [keep].date = (SELECT MAX(date) FROM yourTable WHERE id = [keep].id)
WHERE
   [keep].id IS NULL

DELETE
   [yourTable]
FROM
   [yourTable]
LEFT JOIN
(
   SELECT id, MAX(date) AS date FROM yourTable GROUP BY id
)
   AS [keep]
      ON  [keep].id   = [yourTable].id
      AND [keep].date = [yourTable].date
WHERE
   [keep].id IS NULL

DELETE
   [source]
FROM
   yourTable    AS [source]
WHERE
   [source].row_id != (SELECT TOP 1 row_id FROM yourTable WHERE id = [source].id ORDER BY date DESC)

DELETE
   [source]
FROM
   yourTable    AS [source]
WHERE
   NOT EXISTS (SELECT id FROM yourTable GROUP BY id HAVING id = [source].id AND MAX(date) != [source].date)

Because you are using SQL Server 2000, you'er not able to use the Row Over technique of setting up a sequence and to identify the top row for each unique id.

So, your proposed technique is to use a datetime column to get the top 1 row to remove duplicates. That might work, but there is a possibility that you might still get duplicates having the same datetime value. But that's easy enough to check for.

First check the assumption that all rows are unique based on the id and date columns:

CREATE TABLE #TestTable (rowid INT IDENTITY(1,1), thisid INT, thisdate DATETIME)
INSERT INTO #TestTable (thisid,thisdate) VALUES  (1, '11/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES  (1, '12/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES  (1, '12/12/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES  (2, '1/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES  (2, '1/11/2009')

SELECT COUNT(*) AS thiscount
FROM #TestTable
GROUP BY thisid, thisdate
HAVING COUNT(*) > 1

This example returns a value of 2 - indicating that you will still end up with duplicates even after using the date column to remove duplicates. If you return 0, then you have proven that your proposed technique will work.

When de-duping production data, I think one should take some precautions and test before and after. You should create a table to hold the rows you plan to remove so you can recover them easily if you need to after the delete statement has been executed.

Also, it's a good idea to know beforehand how many rows you plan to remove so you can verify the count before and after - and you can gauge the magnitude of the delete operation. Based on how many rows will be affected, you can plan when to run the operation.

To test before the de-duping process, find the occurrences.

-- Get occurrences of duplicates
SELECT COUNT(*) AS thiscount
FROM 
#TestTable
GROUP BY thisid
HAVING COUNT(*) > 1
ORDER BY thisid

That gives you the rows with more than one row with the same id. Capture the rows from this query into a temporary table and then run a query using the SUM to get the total number of rows that are not unique based on your key.

To get the number of rows you plan to delete, you need the count of rows that are duplicate based on your unique key, and the number of distinct rows based on your unique key. You subtract the distinct rows from the count of occurrences. All that is pretty straightforward - so I'll leave you to it.

Try this

declare @t table (id int, dt DATETIME,rowid INT IDENTITY(1,1))
INSERT INTO @t (id,dt) VALUES  (1, '11/11/2009')
INSERT INTO @t (id,dt) VALUES  (1, '11/12/2009')
INSERT INTO @t (id,dt) VALUES  (1, '11/13/2009')
INSERT INTO @t (id,dt) VALUES  (2, '11/01/2009')

Query:

delete from @t where rowid not in(
select t.rowid from @t t
inner join(
select MAX(dt)maxdate
from @t
group by id) X
on t.dt = X.maxdate )

select * from @t

Output:

id dt rowid
1 2009-11-13 00:00:00.000 3
2 2009-11-01 00:00:00.000 4

delete from temp where row_id not in (
        select t.row_id from temp t
        right join 
        (select id,MAX(dt) as dt from temp group by id) d
        on t.dt = d.dt and t.id = d.id)

I have tested this answer..

INSERT INTO #t (id,dt) VALUES  (1, '11/11/2009')
INSERT INTO #t (id,dt) VALUES  (1, '11/12/2009')
INSERT INTO #t (id,dt) VALUES  (1, '11/13/2009')
INSERT INTO #t (id,dt) VALUES  (2, '11/01/2009')
select * from #t

;WITH T AS(
select dense_rank() over(partition by id order by dt desc)NO,DT,ID,rowid  from #t )

DELETE T  WHERE NO>1