I am working on a problem in SQL Server 2008
I have a table with six columns:
PK INT
dOne SmallINT
dTwo SmallINT
dThree SmallINT
dFour SmallINT
dFiveSmallINT
dSix SmallINT
The table contains around a million recrods. It's probably worth noting that value in column n+1 > value in column n i.e. 97, 98, 99, 120, 135. I am trying to eliminate all rows which have 5 DIGITS in common (ignoring the PK) i.e.:
76, 89, 99, 102, 155, 122
11, 89,开发者_JAVA百科 99, 102, 155, 122
89, 99, 102, 155, 122, 130
In this case the algorithm should start at the first row and delete the second and third rows because they contain 5 matching digits. The first row persists.
I have tried to brute force the solution but finding all the duplicates for only the first record takes upwards of 25 seconds meaning processing the whole table would take... way too long (this should be a repeatable process).
I am fairly new to SQL but this is what I have come up with (I have come up with a few solutions but none were adequate... this is the latest attempt):
(I won't include all the code but I will explain the method, I can paste more if it helps)
Save the digits of record n into variables. SELECT all records which have one digit in common with record n FROM largeTable.
Insert all selected digits into #oneMatch and include [matchingOne] with the digit that matched.
Select all records which have one digit in common with record n FROM the temp table WHERE 'digit in common' != [matching]. INSERT all selected digits into #twoMatch and include [matchingOne] AND [matchingTwo]...
Repeat until inserting into #fiveMatch. Delete #fiveMatch from largeTable and move to record n+1
I am having a problem implementing this solution. How can I assign the matching variable depending on the WHERE clause?
-- SELECT all records with ONE matching field:
INSERT INTO #oneMatch (ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix, mOne)
SELECT ID_pk, dOne, dTwo, dThree, dFour, dFive, dSix
FROM dbo.BaseCombinationsExtended
WHERE ( [dOne] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dOne?
OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dTwo?
OR [dTwo] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dThree?
...
OR [dSix] IN (@dOne, @dTwo, @dThree, @dFour, @dFive, @dSix) **mOne = dSix?
)
I am able to 'fake' the above using six queries but that is too inefficient...
Sorry for the long description. Any help would be greatly appreciated (new solution or implementation of my attempt above) as this problem has been nagging at me for a while...
Unless I miss something this should produce the correct result.
declare @T table
(
PK INT identity primary key,
dOne SmallINT,
dTwo SmallINT,
dThree SmallINT,
dFour SmallINT,
dFive SmallINT,
dSix SmallINT
)
insert into @T values
(76, 89, 99, 102, 155, 122),
(11, 89, 99, 102, 155, 122),
(89, 99, 102, 155, 122, 130)
;with q1(PK, d1, d2, d3, d4, d5) as
(
select PK, dTwo, dThree, dFour, dFive, dSix
from @T
union all
select PK, dOne, dThree, dFour, dFive, dSix
from @T
union all
select PK, dOne, dTwo, dFour, dFive, dSix
from @T
union all
select PK, dOne, dTwo, dThree, dFive, dSix
from @T
union all
select PK, dOne, dTwo, dThree, dFour, dSix
from @T
union all
select PK, dOne, dTwo, dThree, dFour, dFive
from @T
),
q2 as
(
select PK,
row_number() over(partition by d1, d2, d3, d4, d5 order by PK) as rn
from q1
),
q3 as
(
select PK
from q2
where rn = 1
group by PK
having count(*) = 6
)
select T.*
from @T as T
inner join q3 as Q
on T.PK = Q.PK
I can't make any promises on performance, but you can try this. The first thing that I do is put the data into a more normalized structure.
CREATE TABLE dbo.Test_Sets_Normalized (my_id INT NOT NULL, c SMALLINT NOT NULL)
GO
INSERT INTO dbo.Test_Sets_Normalized (my_id, c)
SELECT my_id, c1 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c2 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c3 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c4 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c5 FROM dbo.Test_Sets UNION ALL
SELECT my_id, c6 FROM dbo.Test_Sets
GO
SELECT DISTINCT
T2.my_id
FROM
(SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T1
INNER JOIN (SELECT DISTINCT my_id FROM dbo.Test_Sets_Normalized) T2 ON T2.my_id > T1.my_id
WHERE
(
SELECT
COUNT(*)
FROM
dbo.Test_Sets_Normalized T3
INNER JOIN dbo.Test_Sets_Normalized T4 ON
T4.my_id = T2.my_id AND
T4.c = T3.c
WHERE
T3.my_id = T1.my_id) >= 5
That should get you the IDs that you need. Once you've confirmed that it does what you want, you can JOIN
back to the original table and delete by IDs.
There's probably an improvement possible somewhere that doesn't require the DISTINCT
. I'll give it a little more thought.
Edit - the following approach might be better than N squared performance, depending on the optimizer. If all 5 columns are indexed it should only need 6 index seeks per row, which is still N * logN. It does seem a little dopey though.
You could code generate the where condition based on all the permutations of 5 matches: so the records to delete would be given by:
SELECT * FROM SillyTable ToDelete WHERE EXISTS
(
SELECT PK From SillyTable Duplicate
WHERE ( (
(Duplicate.dOne=ToDelete.dOne)
AND (Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
) OR (
(Duplicate.dOne=ToDelete.dTwo)
AND (Duplicate.dTwo=ToDelete.dThree)
AND (Duplicate.dThree=ToDelete.dFour)
AND (Duplicate.dFour=ToDelete.dFive)
AND (Duplicate.dFive=ToDelete.dSix)
) OR (
(Duplicate.dTwo=ToDelete.dOne)
AND (Duplicate.dThree=ToDelete.dTwo)
AND (Duplicate.dFour=ToDelete.dThree)
AND (Duplicate.dFive=ToDelete.dFour)
AND (Duplicate.dSix=ToDelete.dFive)
) OR (
(Duplicate.dTwo=ToDelete.dTwo)
AND (Duplicate.dThree=ToDelete.dThree)
AND (Duplicate.dFour=ToDelete.dFour)
AND (Duplicate.dFive=ToDelete.dFive)
AND (Duplicate.dSix=ToDelete.dSix)
) ...
This goes on to cover all 36 combinations (there is one non-match on each side of the join, out of 6 possible columns, so 6*6 gives you all the possibilites). I would code generate this because it's a lot of typing, and what if you want 4 out of 6 matches tomorrow, but you could hand code it I guess.
精彩评论