optimal concatenation across rows_问答_开发者_运维开发者技术经验分享

I am looking to optimise a concatenation across multiple rows, and having read some similar questions am familiar with using STUFF + XML path etc. However, when I apply these to my query it usually times out when applying to the 9 million rows or so rows I have

What I'm looking for is a more efficient way of translating this:

create table #fruit
(
Contact_id NVARCHAR(50)
,fruit_type NVARCHAR(50)
,[2005_orders] int
,[2006_orders] int
,[2007_orders] int
,[2008_orders] int
,[2009_orders] int
)
INSERT INTO #fruit VALUES ('id001','banana',1,3,0,25,4)
INSERT INTO #fruit VALUES ('id001','apple',0,7,19,1,0)
INSERT INTO #fruit VALUES ('id001','orange',0,0,0,9,0)
INSERT INTO #fruit VALUES ('id001','strawberry',1,1,1,1,4)
INSERT INTO #fruit VALUES ('id001','grapes',0,3,0,0,0)
INSERT INTO #fruit VALUES ('id001','lemon',1,1,1,0,0)

Into this:

CREATE TABLE #results
(
contact_id NVARCHAR(255)
,fruit_type NVARCHAR(50)
,[2005_orders] int
,[2006_orders] int
,[2007_orders] int
,[2008_orders] int
,[2009_orders] int
,combination2005 NVARCHAR(500)
,combination2006 NVARCHAR(500)
,combination2007 NVARCHAR(500)
,combination2008 NVARCHAR(500)
,combination2009 NVARCHAR(500)
)
INSERT INTO #results VALUES ('id001','banana',1,3,0,25,4,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','apple',0,7,19,1,0,'banana + strawberry + lemo开发者_如何学Gon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','orange',0,0,0,9,0,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','strawberry',1,1,1,1,4,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','grapes',0,3,0,0,0,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')
INSERT INTO #results VALUES ('id001','lemon',1,1,1,0,0,'banana + strawberry + lemon','banana + apple + strawberry + grapes + lemon','apple + strawberry + lemon','banana + apple + orange + strawberry','banana + strawberry')

Where the key things to factor in are that I want a row per fruit type per contact (as this table will be used elsewhere) and that I only want a fruit to make it into the combination type if the count is greater than 0.

It might be that this isn't ever going to be very efficient given the number of rows I'm dealing with, but if there's any chance I can append this information onto my table that would be great :)

Methods tried

Method 1)

SELECT *
,STUFF(
(SELECT ' ' + fruit_type
FROM #fruit fr2
WHERE fr.contact_id = fr2.contact_id
AND 2005_orders > 0
order by contact_id,fruit_type
FOR XML path ('')
)
,1,1,''
) AS combination
FROM #fruit fr

Method 2)

SELECT *
,ISNULL((MAX(CASE WHEN fruit_type = 'banana' AND 2005_orders > 0 THEN 'banana ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'apple' AND 2005_orders > 0 THEN 'apple ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'orange' AND 2005_orders > 0 THEN 'orange' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'strawberry' AND 2005_orders > 0 THEN 'strawberry ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'grapes' AND 2005_orders > 0 THEN 'grapes ' END) OVER (PARTITION BY contact_id)),'')+
ISNULL((MAX(CASE WHEN fruit_type = 'lemon' AND 2005_orders > 0 THEN 'lemon ' END) OVER (PARTITION BY contact_id)),'')+
AS combination05
FROM #fruit fr

-- which is then repeated for years 2006-2009 (which I know is hideously inefficient!)

The performance issue with both of your methods is going to be the subquery. Try this strategy to break it apart and avoid subqueries.

You don't need to use outer joins if you are guaranteed to have records for each contact_id/fruit_type combo.

Index on contact_id should vastly improve performance.

SELECT
 f.*
 , combination2005 = 
     CASE WHEN b.[2005_orders] = 0 OR b.[2005_orders] IS NULL THEN '' ELSE 'banana + ' END
     + CASE WHEN a.[2005_orders] = 0 OR a.[2005_orders] IS NULL THEN '' ELSE 'apple + ' END
     + CASE WHEN o.[2005_orders] = 0 OR o.[2005_orders] IS NULL THEN '' ELSE 'orange + ' END
     + CASE WHEN s.[2005_orders] = 0 OR s.[2005_orders] IS NULL THEN '' ELSE 'strawberry + ' END
  , combination2006 = 
     CASE WHEN b.[2006_orders] = 0 OR b.[2006_orders] IS NULL THEN '' ELSE 'banana + ' END
     + CASE WHEN a.[2006_orders] = 0 OR a.[2006_orders] IS NULL THEN '' ELSE 'apple + ' END
     + CASE WHEN o.[2006_orders] = 0 OR o.[2006_orders] IS NULL THEN '' ELSE 'orange + ' END
     + CASE WHEN s.[2006_orders] = 0 OR s.[2006_orders] IS NULL THEN '' ELSE 'strawberry + ' END
FROM
 #fruit f
LEFT OUTER JOIN
 ( SELECT * FROM #fruit WHERE fruit_type = 'banana' ) b
ON
  f.contact_id = b.contact_id
LEFT OUTER JOIN
 ( SELECT * FROM #fruit WHERE fruit_type = 'apple' ) a
ON
  f.contact_id = a.contact_id
LEFT OUTER JOIN
 ( SELECT * FROM #fruit WHERE fruit_type = 'orange' ) o
ON
  f.contact_id = o.contact_id
LEFT OUTER JOIN
 ( SELECT * FROM #fruit WHERE fruit_type = 'strawberry' ) s
ON
  f.contact_id = s.contact_id

and don't use "SELECT *", I'm just being lazy.

I should add that if you don't expect every contact_id has a record for every fruit_type (thus you need to use outer joins here), then the case expressions should also test for null in addition to zero. (Added that above)

It may not be appropriate/possible in your environment, but given that these figures relate to the past, have you considered pre-generating the concatenated string per contact per year into a table, using whatever method you currently have, and joining that to your results?

You might have to use a scheduled job to maintain the string for the current year, but it seems unlikely that the data for 2005-9 is going to change.

I believe the timeout can be traced to the amount of redundant data you are calculating. When I looked at your samples it seems the list of fruits are calculated for each row in the table. I have take a different approach which might help.

I used the select below to place the fruit name in place of the sales value.

SELECT 
contact_id
,substring([fruit_type],1,cast(CAST([2005_orders] as bit)as int)*50) AS [2005_fruit]
,substring([fruit_type],1,cast(CAST([2006_orders] as bit)as int)*50) AS [2006_fruit]
,substring([fruit_type],1,cast(CAST([2007_orders] as bit)as int)*50) AS [2007_fruit]
,substring([fruit_type],1,cast(CAST([2008_orders] as bit)as int)*50) AS [2008_fruit]
,substring([fruit_type],1,cast(CAST([2009_orders] as bit)as int)*50) AS [2009_fruit]
from #fruit

contact_id                                         2005_fruit                                         2006_fruit                                         2007_fruit                                         2008_fruit                                         2009_fruit
-------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- -------------------------------------------------- --------------------------------------------------
id001                                              banana                                             banana                                                                                                banana                                             banana
id001                                                                                                 apple                                              apple                                              apple                                              
id001                                                                                                                                                                                                       orange                                             
id001                                              strawberry                                         strawberry                                         strawberry                                         strawberry                                         strawberry
id001                                                                                                 grapes                                                                                                                                                   
id001                                              lemon                                              lemon                                              lemon

I believe the next step is to pivot the output of the select shown concatenating the fruit. The target should be a table which could be joined to the origional fruit table. If needed the table could be hardened into the results table except there is way to much redundant data which will kill you IO.