I am searching for a query to select the maximum date (a datetime
column) and keep its id
and row_id
. The desire is to DELETE
the rows in the source table.
Source Data
id date row_id(unique)
1 11/11/2009 1
1 12/11/2009 2
1 13/11/2009 3
2 1/11/2009 4
Expected Survivors
1 13/11/2009 3
2 1/11/2009 4
What query would I开发者_JAVA百科 need to achieve the results I am looking for?
Tested on PostgreSQL:
delete from table where (id, date) not in (select id, max(date) from table group by id);
There are various ways of doing this, but the basic idea is the same:
- Indentify the rows you want to keep
- Compare each row in your table to the ones you want to keep
- Delete any that don't match
DELETE
[source]
FROM
yourTable AS [source]
LEFT JOIN
yourTable AS [keep]
ON [keep].id = [source].id
AND [keep].date = (SELECT MAX(date) FROM yourTable WHERE id = [keep].id)
WHERE
[keep].id IS NULL
DELETE
[yourTable]
FROM
[yourTable]
LEFT JOIN
(
SELECT id, MAX(date) AS date FROM yourTable GROUP BY id
)
AS [keep]
ON [keep].id = [yourTable].id
AND [keep].date = [yourTable].date
WHERE
[keep].id IS NULL
DELETE
[source]
FROM
yourTable AS [source]
WHERE
[source].row_id != (SELECT TOP 1 row_id FROM yourTable WHERE id = [source].id ORDER BY date DESC)
DELETE
[source]
FROM
yourTable AS [source]
WHERE
NOT EXISTS (SELECT id FROM yourTable GROUP BY id HAVING id = [source].id AND MAX(date) != [source].date)
Because you are using SQL Server 2000, you'er not able to use the Row Over technique of setting up a sequence and to identify the top row for each unique id.
So, your proposed technique is to use a datetime column to get the top 1 row to remove duplicates. That might work, but there is a possibility that you might still get duplicates having the same datetime value. But that's easy enough to check for.
First check the assumption that all rows are unique based on the id and date columns:
CREATE TABLE #TestTable (rowid INT IDENTITY(1,1), thisid INT, thisdate DATETIME)
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '11/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '12/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (1, '12/12/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (2, '1/11/2009')
INSERT INTO #TestTable (thisid,thisdate) VALUES (2, '1/11/2009')
SELECT COUNT(*) AS thiscount
FROM #TestTable
GROUP BY thisid, thisdate
HAVING COUNT(*) > 1
This example returns a value of 2 - indicating that you will still end up with duplicates even after using the date column to remove duplicates. If you return 0, then you have proven that your proposed technique will work.
When de-duping production data, I think one should take some precautions and test before and after. You should create a table to hold the rows you plan to remove so you can recover them easily if you need to after the delete statement has been executed.
Also, it's a good idea to know beforehand how many rows you plan to remove so you can verify the count before and after - and you can gauge the magnitude of the delete operation. Based on how many rows will be affected, you can plan when to run the operation.
To test before the de-duping process, find the occurrences.
-- Get occurrences of duplicates
SELECT COUNT(*) AS thiscount
FROM
#TestTable
GROUP BY thisid
HAVING COUNT(*) > 1
ORDER BY thisid
That gives you the rows with more than one row with the same id. Capture the rows from this query into a temporary table and then run a query using the SUM to get the total number of rows that are not unique based on your key.
To get the number of rows you plan to delete, you need the count of rows that are duplicate based on your unique key, and the number of distinct rows based on your unique key. You subtract the distinct rows from the count of occurrences. All that is pretty straightforward - so I'll leave you to it.
Try this
declare @t table (id int, dt DATETIME,rowid INT IDENTITY(1,1))
INSERT INTO @t (id,dt) VALUES (1, '11/11/2009')
INSERT INTO @t (id,dt) VALUES (1, '11/12/2009')
INSERT INTO @t (id,dt) VALUES (1, '11/13/2009')
INSERT INTO @t (id,dt) VALUES (2, '11/01/2009')
Query:
delete from @t where rowid not in(
select t.rowid from @t t
inner join(
select MAX(dt)maxdate
from @t
group by id) X
on t.dt = X.maxdate )
select * from @t
Output:
id dt rowid
1 2009-11-13 00:00:00.000 3
2 2009-11-01 00:00:00.000 4
delete from temp where row_id not in (
select t.row_id from temp t
right join
(select id,MAX(dt) as dt from temp group by id) d
on t.dt = d.dt and t.id = d.id)
I have tested this answer..
INSERT INTO #t (id,dt) VALUES (1, '11/11/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/12/2009')
INSERT INTO #t (id,dt) VALUES (1, '11/13/2009')
INSERT INTO #t (id,dt) VALUES (2, '11/01/2009')
select * from #t
;WITH T AS(
select dense_rank() over(partition by id order by dt desc)NO,DT,ID,rowid from #t )
DELETE T WHERE NO>1
精彩评论