SQL Query to eliminate similar entries_问答_开发者

I am working on a problem in SQL Server 2008

I have a table with six columns:

PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT

The table contains around a million recrods. It's probably worth noting that value in column n+1 > value in column n i.e. 97, 98, 99, 120, 135. I am trying to eliminate all rows which have 5 DIGITS in common (ignoring the PK) i.e.:

76, 89, 99, 102, 155, 122
11, 89,开发者_JAVA百科 99, 102, 155, 122
89, 99, 102, 155, 122, 130

In this case the algorithm should start at the first row and delete the second and third rows because they contain 5 matching digits. The first row persists.

I have tried to brute force the solution but finding all the duplicates for only the first record takes upwards of 25 seconds meaning processing the whole table would take... way too long (this should be a repeatable process).

I am fairly new to SQL but this is what I have come up with (I have come up with a few solutions but none were adequate... this is the latest attempt):

(I won't include all the code but I will explain the method, I can paste more if it helps)

Save the digits of record n into variables. SELECT all records which have one digit in common with record n FROM largeTable.

Insert all selected digits into #oneMatch and include [matchingOne] with the digit that matched.

Select all records which have one digit in common with record n FROM the temp table WHERE 'digit in common' != [matching]. INSERT all selected digits into #twoMatch and include [matchingOne] AND [matchingTwo]...

Repeat until inserting into #fiveMatch. Delete #fiveMatch from largeTable and move to record n+1

I am having a problem implementing this solution. How can I assign the matching variable depending on the WHERE clause?

-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE  ( [dOne] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dOne?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dTwo?
      OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dThree?
...
      OR [dSix] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dSix?
    )

I am able to 'fake' the above using six queries but that is too inefficient...

Sorry for the long description. Any help would be greatly appreciated (new solution or implementation of my attempt above) as this problem has been nagging at me for a while...

Unless I miss something this should produce the correct result.

declare @T table 
(
  PK INT identity primary key,
  dOne SmallINT,
  dTwo SmallINT,
  dThree SmallINT,
  dFour SmallINT,
  dFive SmallINT,
  dSix SmallINT
)

insert into @T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)

;with q1(PK, d1, d2, d3, d4, d5) as
(
  select PK, dTwo, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dThree, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dFour, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFive, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dSix
  from @T
  union all
  select PK, dOne, dTwo, dThree, dFour, dFive
  from @T
),
q2 as
(
  select PK,
         row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
  from q1
),
q3 as
(
  select PK
  from q2
  where rn = 1
  group by PK
  having count(*) = 6   
)
select T.*
from @T as T
  inner join q3 as Q
    on T.PK = Q.PK

I can't make any promises on performance, but you can try this. The first thing that I do is put the data into a more normalized structure.

CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO

INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO

SELECT DISTINCT
    T2.my_id
FROM
    (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
    (
    SELECT
        COUNT(*)
    FROM
        dbo.Test_Sets_Normalized T3
    INNER JOIN dbo.Test_Sets_Normalized T4 ON
        T4.my_id = T2.my_id AND
        T4.c = T3.c
    WHERE
        T3.my_id = T1.my_id) >= 5

That should get you the IDs that you need. Once you've confirmed that it does what you want, you can JOIN back to the original table and delete by IDs.

There's probably an improvement possible somewhere that doesn't require the DISTINCT. I'll give it a little more thought.

Edit - the following approach might be better than N squared performance, depending on the optimizer. If all 5 columns are indexed it should only need 6 index seeks per row, which is still N * logN. It does seem a little dopey though.

You could code generate the where condition based on all the permutations of 5 matches: so the records to delete would be given by:

SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
    SELECT PK From SillyTable Duplicate 
    WHERE (   (
            (Duplicate.dOne=ToDelete.dOne) 
            AND (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree)
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
        ) OR (
            (Duplicate.dOne=ToDelete.dTwo) 
            AND (Duplicate.dTwo=ToDelete.dThree) 
            AND (Duplicate.dThree=ToDelete.dFour)
            AND (Duplicate.dFour=ToDelete.dFive)
            AND (Duplicate.dFive=ToDelete.dSix)
        ) OR (
            (Duplicate.dTwo=ToDelete.dOne) 
            AND (Duplicate.dThree=ToDelete.dTwo) 
            AND (Duplicate.dFour=ToDelete.dThree)
            AND (Duplicate.dFive=ToDelete.dFour)
            AND (Duplicate.dSix=ToDelete.dFive)
        ) OR (
            (Duplicate.dTwo=ToDelete.dTwo) 
            AND (Duplicate.dThree=ToDelete.dThree) 
            AND (Duplicate.dFour=ToDelete.dFour)
            AND (Duplicate.dFive=ToDelete.dFive)
            AND (Duplicate.dSix=ToDelete.dSix)
        ) ...

This goes on to cover all 36 combinations (there is one non-match on each side of the join, out of 6 possible columns, so 6*6 gives you all the possibilites). I would code generate this because it's a lot of typing, and what if you want 4 out of 6 matches tomorrow, but you could hand code it I guess.