Selecting from a subset based on a value inside the same subset?_问答_开发者

I have created a table like this:

CREATE TABLE #TEMP(RecordDate datetime, First VARCHAR(255), Last VARCHAR(255), Value int)

INSERT INTO #TEMP VALUES('2011-03-01 00:00:00.000','john','smith','10')
INSERT INTO #TEMP VALUES('2011-03-01 00:00:00.000','john','adams','60')
INSERT INTO #TEMP VALUES('2011-03-01 00:00:00.000','john','resig','90')
INSERT INTO #TEMP VALUES('2011-03-01 00:00:00.000','john','balte','95')

INSERT INTO #TEMP VALUES('2011-03-01 01:00:00.000','john','smith','98')
INSERT INTO #TEMP VALUES('2011-03-01 01:00:00.000','john','adams','67')
INSERT INTO #TEMP VALUES('2011-03-01 01:00:00.000','john','resig','24')
INSERT INTO #TEMP VALUES('2011-03-01 01:00:00.000','john','balte','20')

SELECT * FROM #TEMP

DROP TABLE #TEMP

which now contains the following records:

RecordDate              First   Last    Value
2011-03-01 00:00:00.000 john    smith   10
2011-03-01 00:00:00.000 john    adams   60
2011-03-01 00:00:00.000 john    resig   90
2011-03-01 00:00:00.000 john    balte   95
2011-03-01 01:00:00.000 john    smith   98
2011-03-01 01:00:00.000 john    adams   67
2011-03-01 01:00:00.000 john    resig   24
2011-03-01 01:00:00.000 john    balte   20

I am trying to obtain a table like the following:

RecordDate                first    Good     Bad
2011-03-01 00:00:00.000   john     3        1
2011-03-01 01:00:00.000   john     2        2

The way I am computing Good and Bad is by taking the MAX of all people with the first name john on the specific date and then applying it as a filter on the original dataset for that particular date and first name. Only values greater than 0.5*MAXValue are considered Good.

In the result table, there are 3 good values because the maximum value for the first date was 95 and only 60,90,95 are greater than 0.5*95 so the result has (Good,Bad) = (3,1). In the second result, likewise, it is (2,2).

My table is sufficiently big and has close to 300 million records and I am not able to understand where to start to do this efficiently. Any suggestions on what an efficient way might look like?

My current (working but expensive) approach is give below:

SELECT    RecordDate
        , FirstName
        , 
        (
            SELECT COUNT(*) 
            FROM #TEMP
            WHERE Value > 0.5*(SELECT MAX(Value) FROM #TEMP WHERE RecordDate = A.RecordDate AND FirstName = A.FirstName)
            AND RecordDate = A.RecordDate AND FirstName = A.FirstName
        ) AS Good
        ,
        (
            SELECT COUNT(*) 
            FROM #TEMP
            WHERE Value < 0.5*(SELECT MAX(Value) FROM #TEMP WHERE RecordDate = A.RecordDate AND FirstName = A.FirstName)
            AND RecordDate = A.RecordDate AND FirstName = A.FirstName
        ) AS Bad
FROM #TEMP A
GROUP BY RecordDat开发者_高级运维e, FirstName;

Here you go:

select 
   t.RecordDate,
   COUNT(case 
           when t.Value > MV.MaxValue * 0.5 then 1
           else null
         end) Good,
   COUNT(case 
           when t.Value <= MV.MaxValue * 0.5 then 1
           else null
         end) Bad
from #Temp t inner join
(select RecordDate, MAX(Value) MaxValue
 from #Temp Group By RecordDate) MV on t.RecordDate = MV.RecordDate
Group by t.RecordDate

The trick is creating a derived table with the max values for each record date and then INNER JOIN it with the table itself. Once you get the max values solved, you can access them directly.

Update

I see you updated your question and included the first name in the result. Never fear, here's the solution:

select 
   t.RecordDate,
   t.First,
   COUNT(case 
           when t.Value > MV.MaxValue * 0.5 then 1
           else null
         end) Good,
   COUNT(case 
           when t.Value <= MV.MaxValue * 0.5 then 1
           else null
         end) Bad
from #Temp t inner join
(select RecordDate, First, MAX(Value) MaxValue
 from #Temp Group By RecordDate, First) MV 
   on (t.RecordDate = MV.RecordDate and t.First = MV.First)
Group by t.RecordDate, t.First

The nested queries that refer to the outer query may be causing a lot of repetitive work. This will just calculate all the MAX for all names and dates in one go:

SELECT RecordDate, FirstName, MAX(Value) FROM #TEMP GROUP BY RecordDate, FirstName

Now join back to the original data:

SELECT A.RecordDate, A.FirstName,
       SUM(CASE WHEN Value > MaxVal*0.5 THEN 1 ELSE 0 END) AS GOOD,
       SUM(CASE WHEN Value > MaxVal*0.5 THEN 0 ELSE 1 END) AS BAD,
FROM #TEMP A INNER JOIN
     (SELECT RecordDate, FirstName, MAX(Value) as MaxVal 
      FROM #TEMP GROUP BY RecordDate, FirstName) B 
         ON (A.RecordDate = B.RecordDate AND A.FirstName = B.FirstName)
GROUP BY A.RecordDate, A.FirstName