I am looking to optimise a concatenation across multiple rows, and having read some similar questions am familiar with using STUFF + XML path etc. However, when I apply these to my query it usually times out when applying to the 9 million rows or so rows I have
What I'm looking for is a more efficient way of translating this:
create table #fruit
(
Contact_id NVARCHAR(50)
,fruit_type NVARCHAR(50)
,[2005_orders] int
,[2006_orders] int
,[2007_orders] int
,[2008_orders] int
,[2009_orders] int
)
INSERT INTO #fruit VALUES ('id001','banana',1,3,0,25,4)
INSERT INTO #fruit VALUES ('id001','apple',0,7,19,1,0)
INSERT INTO #fruit VALUES ('id001','orange',0,0,0,9,0)
INSERT INTO #fruit VALUES ('id001','strawberry',1,1,1,1,4)
INSERT INTO #fruit VALUES ('id001','grapes',0,3,0,0,0)
INSERT INTO #fruit VALUES ('id001','lemon',1,1,1,0,0)
Into this:
CREATE TABLE #results
(
contact_id NVARCHAR(255)
,fruit_type NVARCHAR(50)
,[2005_orders] int
,[2006_orders] int
,[2007_orders] int
,[2008_orders] int
,[2009_orders] int
,combination2005 NVARCHAR(500)
,combination2006 NVARCHAR(500)
,combination2007 NVARCHAR(500)
,combination2008 NVARCHAR(500)
,combination2009 NVARCHAR(500)
)
INSERT INTO #results VALUES ('id001','banana',1,3,0,25,4,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','apple',0,7,19,1,0,'banana + strawberry + lemo开发者_如何学Gon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','orange',0,0,0,9,0,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','strawberry',1,1,1,1,4,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','grapes',0,3,0,0,0,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','lemon',1,1,1,0,0,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
Where the key things to factor in are that I want a row per fruit type per contact (as this table will be used elsewhere) and that I only want a fruit to make it into the combination type if the count is greater than 0.
It might be that this isn't ever going to be very efficient given the number of rows I'm dealing with, but if there's any chance I can append this information onto my table that would be great :)
Methods tried
Method 1)
SELECT *
,STUFF(
(SELECT ' ' + fruit_type
FROM #fruit fr2
WHERE fr.contact_id = fr2.contact_id
AND 2005_orders > 0
order by contact_id,fruit_type
FOR XML path ('')
)
,1,1,''
) AS combination
FROM #fruit fr
Method 2)
SELECT *
,ISNULL((MAX(CASE WHEN fruit_type = 'banana' AND 2005_orders > 0 THEN 'banana ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'apple' AND 2005_orders > 0 THEN 'apple ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'orange' AND 2005_orders > 0 THEN 'orange' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'strawberry' AND 2005_orders > 0 THEN 'strawberry ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'grapes' AND 2005_orders > 0 THEN 'grapes ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'lemon' AND 2005_orders > 0 THEN 'lemon ' END) OVER (PARTITION BY contact_id)),'')+
AS combination05
FROM #fruit fr
-- which is then repeated for years 2006-2009 (which I know is hideously inefficient!)
The performance issue with both of your methods is going to be the subquery. Try this strategy to break it apart and avoid subqueries.
You don't need to use outer joins if you are guaranteed to have records for each contact_id/fruit_type combo.
Index on contact_id should vastly improve performance.
SELECT
f.*
, combination2005 =
CASE WHEN b.[2005_orders] = 0 OR b.[2005_orders] IS NULL THEN '' ELSE 'banana + ' END
+ CASE WHEN a.[2005_orders] = 0 OR a.[2005_orders] IS NULL THEN '' ELSE 'apple + ' END
+ CASE WHEN o.[2005_orders] = 0 OR o.[2005_orders] IS NULL THEN '' ELSE 'orange + ' END
+ CASE WHEN s.[2005_orders] = 0 OR s.[2005_orders] IS NULL THEN '' ELSE 'strawberry + ' END
, combination2006 =
CASE WHEN b.[2006_orders] = 0 OR b.[2006_orders] IS NULL THEN '' ELSE 'banana + ' END
+ CASE WHEN a.[2006_orders] = 0 OR a.[2006_orders] IS NULL THEN '' ELSE 'apple + ' END
+ CASE WHEN o.[2006_orders] = 0 OR o.[2006_orders] IS NULL THEN '' ELSE 'orange + ' END
+ CASE WHEN s.[2006_orders] = 0 OR s.[2006_orders] IS NULL THEN '' ELSE 'strawberry + ' END
FROM
#fruit f
LEFT OUTER JOIN
( SELECT * FROM #fruit WHERE fruit_type = 'banana' ) b
ON
f.contact_id = b.contact_id
LEFT OUTER JOIN
( SELECT * FROM #fruit WHERE fruit_type = 'apple' ) a
ON
f.contact_id = a.contact_id
LEFT OUTER JOIN
( SELECT * FROM #fruit WHERE fruit_type = 'orange' ) o
ON
f.contact_id = o.contact_id
LEFT OUTER JOIN
( SELECT * FROM #fruit WHERE fruit_type = 'strawberry' ) s
ON
f.contact_id = s.contact_id
and don't use "SELECT *", I'm just being lazy.
I should add that if you don't expect every contact_id has a record for every fruit_type (thus you need to use outer joins here), then the case expressions should also test for null in addition to zero. (Added that above)
It may not be appropriate/possible in your environment, but given that these figures relate to the past, have you considered pre-generating the concatenated string per contact per year into a table, using whatever method you currently have, and joining that to your results?
You might have to use a scheduled job to maintain the string for the current year, but it seems unlikely that the data for 2005-9 is going to change.
I believe the timeout can be traced to the amount of redundant data you are calculating. When I looked at your samples it seems the list of fruits are calculated for each row in the table. I have take a different approach which might help.
I used the select below to place the fruit name in place of the sales value.
SELECT
contact_id
,substring([fruit_type],1,cast(CAST([2005_orders] as bit)as int)*50) AS [2005_fruit]
,substring([fruit_type],1,cast(CAST([2006_orders] as bit)as int)*50) AS [2006_fruit]
,substring([fruit_type],1,cast(CAST([2007_orders] as bit)as int)*50) AS [2007_fruit]
,substring([fruit_type],1,cast(CAST([2008_orders] as bit)as int)*50) AS [2008_fruit]
,substring([fruit_type],1,cast(CAST([2009_orders] as bit)as int)*50) AS [2009_fruit]
from #fruit
contact_id 2005_fruit 2006_fruit 2007_fruit 2008_fruit 2009_fruit
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
id001 banana banana banana banana
id001 apple apple apple
id001 orange
id001 strawberry strawberry strawberry strawberry strawberry
id001 grapes
id001 lemon lemon lemon
I believe the next step is to pivot the output of the select shown concatenating the fruit. The target should be a table which could be joined to the origional fruit table. If needed the table could be hardened into the results table except there is way to much redundant data which will kill you IO.
精彩评论