I thought I had this figured out but it turns out I'm just deleting the first record. The following returns the duplicate rows. All have开发者_JAVA百科 a count of 2. I just want to delete the first one for each duplicate record.
select scorestudentid, scoreadvisor, scorecorrect, count(*)
from scores
where scoretestid = 3284
group by scorestudentid, scoreadvisor, scorecorrect
having count(scorestudentid) > 1
Which returns:
scorestudentid scoreadvisor scorecorrect no column name
13033719 28059 3.0 2
13033777 28086 3.0 2
13033826 28147 3.0 2
13033960 28023 3.0 2
So I put this together thinking it would work:
set rowcount 1
delete
from scores
where scoretestid = 3284
and scorestudentid in (
select scorestudentid
from scores
where scoretestid = 3284
group by scorestudentid
having count(scorestudentid) > 1)
It really seems like it should be a simple concept but I'm not getting it.
Based on Thomas script I updated the query to fit but it still doesn't work.
Delete Scores
Where Exists (
Select 1
From Scores As S2
Where S2.ScoreStudentId = Scores.ScoreStudentId
And S2.ScoreAdvisor = Scores.ScoreAdvisor
And S2.ScoreCorrect = Scores.ScoreCorrect
Group By S2.ScoreStudentId, S2.ScoreAdvisor, S2.ScoreCorrect
Having Count(*) > 1
And Min(S2.NewScoreID) = Scores.NewScoreID
)
And Scores.ScoreTestId = 3284
The trick is using the Primary Key column (you do have one, correct?) and simply finding the first PK value that matches the criteria you want. If for some crazy reason you do not have a primary key column, then add an Identity column and make it the primary key and then do the delete.
EDIT Revised to make it more generic. If you remove the final filter on ScoreTest, it will remove all duplicates based on ScoreStudentId, ScoreAdvisor and ScoreCorrect.
Delete Scores
Where Exists (
Select 1
From Scores As S2
Where S2.ScoreStudentId = Scores.ScoresStudentId
And S2.ScoreAdvisor = Scores.ScoreAdvisor
And S2.ScoreCorrect = Scores.ScoreCorrect
Group By S2.ScoreStudentId, S2.ScoreAdvisor, S2.ScoreCorrect
Having Count(*) > 1
And Min(S2.PrimaryKeyColumn) = Scores.PrimaryKeyColumn
)
And Scores.ScoreTest = 3284
I believe Thomas' solution doesn't work with uniqueidentifiers for primary keys. Also, if one record is duplicated multiple times (i.e. 3, 4, 5+ times) in the table, it will only remove one.
This is what we used:
declare @col1 uniqueidentifier declare @col2 varchar(256) declare @col3 datetime
DECLARE C CURSOR
FOR
select col1, col2, col3
from MyTable
where IsDeleted = 0
group by col1, col2, col3
having count(*) > 1
OPEN C
FETCH NEXT FROM C
INTO @col1, @col2, @col3
WHILE @@FETCH_STATUS = 0
BEGIN
declare @primaryKey uniqueidentifier
set @primaryKey = (select top 1 primaryKey from MyTable
where col1 = @col1 and col2= @col2 and col3 = @col3)
update MyTable
set IsDeleted = 1, DeleteDt = GETDATE()
where col1 = @col1
and col2 = @col2
and col3 = @col3
and PrimaryKey<> @primaryKey
FETCH NEXT FROM C
INTO @col1, @col2, @col3
END
CLOSE C
DEALLOCATE C
What this cursor does is:
- select all rows that have duplicates
- for each duplicate row-set:
- get the primary key of one of the rows in the set
- logically delete all the other rows in the row-set
I am going to talk about an interesting topic in SQL world. If you Google this topic you will find multiple ways of deleting duplicate data from the table. I am not going to write something very new but yes I will talk about performance issue while deleting duplicate data using traditional approach.
Deleting duplicate rows from SQL 2000:- I have created a table DuplicateData and inserted few duplicate rows based on EmpId.
Create table DuplicateData(EmpId int,Name varchar(100))-->Table Creation
insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(4,'Akshay')
insert into DuplicateData values(5,'ankit')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(3,'Vikas')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(2,'Raj')
insert into DuplicateData values(1,'Neeraj')
insert into DuplicateData values(1,'Neeraj')
insert into DuplicateData values(1,'Neeraj')
Traditional way of deleting duplicate rows from table in SQL 2000 :-If we run the below batch in query analyzer, it will remove all the duplicate values from table DuplicateData. This query is "OK" if you are doing it in test environment or on dummy data. But if you have millions of record or large data, this query would be the worst query in terms of performance. It can take few hours or may be days depending on the volume of the data in the intended table.
Reason:- Below query is a correlated sub query which will execute for each EmpId present in the table and check if count for each EmpId is > 1 then delete each record one by one. That is the reason for its slow performance.
set rowcount 1
delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1
while @@rowcount>0
delete from DuplicateData where (select count(EmpId) from DuplicateData a where a.EmpId=DuplicateData.EmpId)>1
set rowcount 0
We can create a stored procedure to overcome this performance issue. Below is the example.
declare @tmp table(empid int,cnt int, rowid int identity)--> declare table variable
declare @maxcounter as integer--> Declaration of variables
declare @mincounter as integer
declare @rowcnt as integer
declare @empid as int-->End of Declaration
insert into @tmp(empid,cnt)-->Inserting duplicate empid along with no of duplicate entries
select empid,count(empid) from duplicatedata
group by empid having count(empid)>1
select @mincounter=min(rowid),@maxcounter=max(rowid) from @tmp -->assigning minimum and maximum rowid to variables.
while @mincounter <=@maxcounter
begin
select @rowcnt=cnt,@empid=empid from @tmp where rowid=@mincounter
set @rowcnt =@rowcnt-1
set rowcount @rowcnt
delete from duplicatedata where empid=@empid
set rowcount 0
set @mincounter=@mincounter +1
end
let's understand the above while loop, we have all duplicate records in @tmp table with no of duplicate entries. Now we will loop over each record present in @tmp table, for that reason we have assigned minimum and maximum rowid to variables(@maxcounter, @mincounter).
In While loop body, we are assigning "no of duplicate records" values to variable @rowcnt and empid to variable @empid
In the next statement we are setting @rowcnt=@rowcnt-1, we are doing this because this variable contains no of duplicate records for a particular empid, but we want to keep one empid from the duplicate ones. in next statement we are setting up the rowcount with one value lesser than the no of duplicate records for that particular empid.
Next statement reset the rowcount with 0 and last statement is increasing the @mincounter value to fetch next record from @tmp table.
精彩评论