Aggregating one column based on the distinct value of another column_问答_开发者

This scenario is based upon a schema in another question and I'm not interested in any discussion about the validity of the schema!

I'm interested to know if there are any good techniques in SQL Server to perform an aggregation of one column (amount1开发者_如何学C below) based on the distinct value of another column (id1).

Plan1 below scans table1 twice, performs two aggregations by p_id then joins the result together. It seems as though this could be improved upon. Query 2 could return the wrong result in some circumstances and the plan is worse anyway!

Any ideas?

DDL

IF OBJECT_ID('tempdb..#table1') IS NOT NULL DROP TABLE #table1;
IF OBJECT_ID('tempdb..#table2') IS NOT NULL DROP TABLE #table2;

CREATE TABLE #table1 (id1 int primary key nonclustered, amount1 int, p_id int);
CREATE CLUSTERED INDEX ix ON #table1 (p_id,id1);
INSERT INTO #table1
SELECT 1,500,10 UNION ALL
SELECT 2,700,20 UNION ALL
SELECT 3,500,10 UNION ALL
SELECT 4,450,20 UNION ALL
SELECT 5,300,10;

CREATE TABLE #table2 (id2 int primary key, amount2 int, id1 int);
INSERT INTO #table2
SELECT 1,300,1 UNION ALL
SELECT 2,200,1 UNION ALL
SELECT 3,200,2 UNION ALL
SELECT 4,500,2 UNION ALL
SELECT 5,400,3 UNION ALL
SELECT 6,150,4 UNION ALL
SELECT 7,300,4 UNION ALL
SELECT 8,300,5;

Query 1

WITH t1
     AS (SELECT p_id,SUM(amount1) AS total1
         FROM   #table1
         GROUP  BY p_id),
     t2
     AS (SELECT p_id,SUM(amount2) AS total2
         FROM   #table2 table2
                JOIN #table1 table1
                  ON table1.id1 = table2.id1
         GROUP  BY p_id)
SELECT t1.p_id,total1,total2
FROM   t1
       JOIN t2
         ON t1.p_id = t2.p_id

Plan 1

Aggregating one column based on the distinct value of another column

Query 2

SELECT table1.p_id, 
       FLOOR(SUM(DISTINCT amount1 + table1.id1/100000000.0)) AS total1, 
       SUM(amount2) AS total2
FROM #table1 table1 JOIN #table2 table2 ON table1.id1=table2.id1
GROUP BY table1.p_id

Plan 2

Aggregating one column based on the distinct value of another column

This one will scan each record in either table only once:

SELECT  p_id, SUM(amount1) AS total1, SUM(s_amount2) AS total2
FROM    #table1 t1
CROSS APPLY
        (
        SELECT  SUM(amount2) AS s_amount2
        FROM    #table2 t2
        WHERE   t2.id1 = t1.id1
        ) t2
GROUP BY
        p_id

  |--Compute Scalar(DEFINE:([Expr1006]=CASE WHEN [Expr1026]=(0) THEN NULL ELSE [Expr1027] END, [Expr1007]=CASE WHEN [Expr1028]=(0) THEN NULL ELSE [Expr1029] END))
       |--Stream Aggregate(GROUP BY:([t1].[p_id]) DEFINE:([Expr1026]=COUNT_BIG([tempdb].[dbo].[#table1].[amount1] as [t1].[amount1]), [Expr1027]=SUM([tempdb].[dbo].[#table1].[amount1] as [t1].[amount1]), [Expr1028]=COUNT_BIG([Expr1005]), [Expr1029]=SUM([Expr1005])))
            |--Nested Loops(Left Outer Join, OUTER REFERENCES:([t1].[id1]))
                 |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[#table1] AS [t1]), ORDERED FORWARD)
                 |--Compute Scalar(DEFINE:([Expr1005]=CASE WHEN [Expr1024]=(0) THEN NULL ELSE [Expr1025] END))
                      |--Stream Aggregate(DEFINE:([Expr1024]=COUNT_BIG([tempdb].[dbo].[#table2].[amount2] as [t2].[amount2]), [Expr1025]=SUM([tempdb].[dbo].[#table2].[amount2] as [t2].[amount2])))
                           |--Clustered Index Scan(OBJECT:([tempdb].[dbo].[#table2] AS [t2]), WHERE:([tempdb].[dbo].[#table2].[id1] as [t2].[id1]=[tempdb].[dbo].[#table1].[id1] as [t1].[id1]))

, though it's not necessarily more efficient.

This one:

SELECT  p_id, SUM(amount1) AS total1, SUM(s_amount2) AS total2
FROM    #table1 t1
JOIN    (
        SELECT  id1, SUM(amount2) AS s_amount2
        FROM    #table2
        GROUP BY
                id1
        ) t2
ON      t2.id1 = t1.id1
GROUP BY
        p_id

will do the same with more options for the joins, however, an extra spool may be used in the plan if t2 will be chosen leading.

Well, @Quassnoi solution seems pretty good. In any case, for SQL Server 2005+, you can use the PARTITION BY clause to try and make a simpler query, but the execution plan is not better, though it doesn't necessarily mean is more or less efficient.

SELECT A.p_id, MIN(amount1) total1, SUM(amount2) total2
FROM (SELECT p_id, id1, SUM(amount1) OVER(PARTITION BY p_id) amount1 FROM #table1) A
JOIN #table2 B
ON A.id1 = B.id1
GROUP BY A.p_id