开发者

How to Delete Duplicate Rows in SQL 2000?

开发者 https://www.devze.com 2022-12-28 05:15 出处:网络
I thought I had this figured out but it turns out I\'m just deleting the first record. The following returns the duplicate rows. All have开发者_JAVA百科 a count of 2. I just want to delete the first o

I thought I had this figured out but it turns out I'm just deleting the first record. The following returns the duplicate rows. All have开发者_JAVA百科 a count of 2. I just want to delete the first one for each duplicate record.

select scorestudentid, scoreadvisor, scorecorrect, count(*) 
from scores
where scoretestid = 3284
group by scorestudentid, scoreadvisor, scorecorrect
having count(scorestudentid) > 1

Which returns:

scorestudentid  scoreadvisor  scorecorrect  no column name
13033719        28059     3.0           2
13033777        28086     3.0           2
13033826        28147     3.0           2
13033960        28023     3.0           2

So I put this together thinking it would work:

set rowcount 1
delete
from scores
where scoretestid = 3284 
and scorestudentid in (
    select scorestudentid
    from scores
    where scoretestid = 3284
    group by scorestudentid
    having count(scorestudentid) > 1)

It really seems like it should be a simple concept but I'm not getting it.

Based on Thomas script I updated the query to fit but it still doesn't work.

Delete Scores
Where Exists    (
                Select 1
                From Scores As S2
                Where S2.ScoreStudentId = Scores.ScoreStudentId
                        And S2.ScoreAdvisor = Scores.ScoreAdvisor
                        And S2.ScoreCorrect = Scores.ScoreCorrect
                Group By S2.ScoreStudentId, S2.ScoreAdvisor, S2.ScoreCorrect
                Having Count(*) > 1
                    And Min(S2.NewScoreID) = Scores.NewScoreID
                )
    And Scores.ScoreTestId = 3284


The trick is using the Primary Key column (you do have one, correct?) and simply finding the first PK value that matches the criteria you want. If for some crazy reason you do not have a primary key column, then add an Identity column and make it the primary key and then do the delete.

EDIT Revised to make it more generic. If you remove the final filter on ScoreTest, it will remove all duplicates based on ScoreStudentId, ScoreAdvisor and ScoreCorrect.

Delete Scores
Where Exists    (
                Select 1
                From Scores As S2
                Where S2.ScoreStudentId = Scores.ScoresStudentId
                        And S2.ScoreAdvisor = Scores.ScoreAdvisor
                        And S2.ScoreCorrect = Scores.ScoreCorrect
                Group By S2.ScoreStudentId, S2.ScoreAdvisor, S2.ScoreCorrect
                Having Count(*) > 1
                    And Min(S2.PrimaryKeyColumn) = Scores.PrimaryKeyColumn
                )
    And Scores.ScoreTest = 3284


I believe Thomas' solution doesn't work with uniqueidentifiers for primary keys. Also, if one record is duplicated multiple times (i.e. 3, 4, 5+ times) in the table, it will only remove one.

This is what we used:

declare @col1 uniqueidentifier declare @col2 varchar(256) declare @col3 datetime

DECLARE C CURSOR
FOR

            select col1, col2, col3
            from MyTable
            where IsDeleted = 0
            group by col1, col2, col3
            having count(*) > 1
OPEN    C
FETCH NEXT FROM C
INTO    @col1, @col2, @col3

WHILE @@FETCH_STATUS = 0
BEGIN

declare @primaryKey uniqueidentifier
set @primaryKey = (select top 1 primaryKey from MyTable
                            where col1 = @col1 and col2= @col2 and col3 = @col3)

update MyTable
set IsDeleted = 1, DeleteDt = GETDATE()
where col1 = @col1
    and col2 = @col2
    and col3 = @col3
    and PrimaryKey<> @primaryKey


FETCH NEXT FROM C
INTO    @col1, @col2, @col3
END

CLOSE C
DEALLOCATE C

What this cursor does is:

  • select all rows that have duplicates
  • for each duplicate row-set:
  • get the primary key of one of the rows in the set
  • logically delete all the other rows in the row-set


I am going to talk about an interesting topic in SQL world. If you Google this topic you will find multiple ways of deleting duplicate data from the table. I am not going to write something very new but yes I will talk about performance issue while deleting duplicate data using traditional approach.

Deleting duplicate rows from SQL 2000:- I have created a table DuplicateData and inserted few duplicate rows based on EmpId.

Create table DuplicateData(EmpId int,Name varchar(100))-->Table Creation

insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(5,'ankit')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(1,'Neeraj')
insert into DuplicateData values(1,'Neeraj')

insert into DuplicateData values(1,'Neeraj')

Traditional way of deleting duplicate rows from table in SQL 2000 :-If we run the below batch in query analyzer, it will remove all the duplicate values from table DuplicateData. This query is "OK" if you are doing it in test environment or on dummy data. But if you have millions of record or large data, this query would be the worst query in terms of performance. It can take few hours or may be days depending on the volume of the data in the intended table.

Reason:- Below query is a correlated sub query which will execute for each EmpId present in the table and check if count for each EmpId is > 1 then delete each record one by one. That is the reason for its slow performance.

set rowcount 1
delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1
while @@rowcount>0
delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1

set rowcount 0

We can create a stored procedure to overcome this performance issue. Below is the example.

declare @tmp table(empid int,cnt int, rowid int identity)--> declare table variable

declare @maxcounter as integer--> Declaration of variables
declare @mincounter as integer
declare @rowcnt as integer
declare @empid as int-->End of Declaration

insert into @tmp(empid,cnt)-->Inserting duplicate empid along with no of duplicate entries
select empid,count(empid) from duplicatedata 
group by empid having count(empid)>1

select @mincounter=min(rowid),@maxcounter=max(rowid) from @tmp -->assigning minimum and maximum rowid to variables.

while @mincounter <=@maxcounter
begin
 select @rowcnt=cnt,@empid=empid from @tmp where rowid=@mincounter 
 set @rowcnt =@rowcnt-1
 set rowcount @rowcnt
 delete from duplicatedata where empid=@empid
 set rowcount 0
 set @mincounter=@mincounter +1
end

let's understand the above while loop, we have all duplicate records in @tmp table with no of duplicate entries. Now we will loop over each record present in @tmp table, for that reason we have assigned minimum and maximum rowid to variables(@maxcounter, @mincounter).

In While loop body, we are assigning "no of duplicate records" values to variable @rowcnt and empid to variable @empid

In the next statement we are setting @rowcnt=@rowcnt-1, we are doing this because this variable contains no of duplicate records for a particular empid, but we want to keep one empid from the duplicate ones. in next statement we are setting up the rowcount with one value lesser than the no of duplicate records for that particular empid.

Next statement reset the rowcount with 0 and last statement is increasing the @mincounter value to fetch next record from @tmp table.

0

精彩评论

暂无评论...
验证码 换一张
取 消