开发者

Median values in T-SQL

开发者 https://www.devze.com 2023-04-01 13:29 出处:网络
For even rows, the formula for median is (104.5 + 108)/2 for the table below and for odd r开发者_运维百科ows it is 108 for table below:

For even rows, the formula for median is (104.5 + 108)/2 for the table below and for odd r开发者_运维百科ows it is 108 for table below:

Total       Total

100         100
101         101
104.5       104.5
108         108
108.3       108.3
112         112
            114

I wrote this query, and it is calculating the correct median when the number of rows are odd:

WITH    a AS ( SELECT   Total ,
                        ROW_NUMBER() OVER ( ORDER BY CAST(Total AS FLOAT) ASC ) rownumber
               FROM     [Table] A
             ),
        b AS ( SELECT TOP 2
                        Total ,
                        isodd
               FROM     ( SELECT TOP 50 PERCENT
                                    Total ,
                                    rownumber % 2 isodd
                          FROM      a
                          ORDER BY  CAST(Total AS FLOAT) ASC
                        ) a
               ORDER BY CAST(total AS FLOAT) DESC
             )
    SELECT  *
    FROM    b

What is the general T-SQL query to find the median in both situations? Like when the number of rows are odd and also when the number of rows is even?

Could my query be twisted so that it can work for the median in both even and odd number of rows situations?


I wrote a blog about Mean, Median and Mode a couple years ago. I encourage you to read it.

Calculating Mean, Median, and Mode with SQL Server

SELECT ((
        SELECT TOP 1 Total
        FROM   (
                SELECT  TOP 50 PERCENT Total
                FROM    [TABLE] A
                WHERE   Total IS NOT NULL
                ORDER BY Total
                ) AS A
        ORDER BY Total DESC) +
        (
        SELECT TOP 1 Total
        FROM   (
                SELECT  TOP 50 PERCENT Total
                FROM    [TABLE] A
                WHERE   Total IS NOT NULL
                ORDER BY Total DESC
                ) AS A
        ORDER BY Total ASC)) / 2


I know you were looking for a solution that works with SQL Server 2008, but in case anyone is looking for the MEDIAN() aggregate function in SQL Server 2012, they can emulate it using the PERCENTILE_CONT() inverse distribution function:

WITH t(value) AS (
  SELECT 1   UNION ALL
  SELECT 2   UNION ALL
  SELECT 100 
)
SELECT DISTINCT
  percentile_cont(0.5) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY 1)
FROM
  t;

This emulation of MEDIAN() via PERCENTILE_CONT() is also documented here. Unfortunately, SQL Server only supports this function as a window function, not as a regular ordered-set aggregate function like Oracle or PostgreSQL.


An example for issue mentioned in my comment to the accepted answer:

select avg(Total) median from
(
select Total, 
rnasc = row_number() over(order by Total),
rndesc = row_number() over(order by Total desc)
 from [Table] 
) b
where rnasc between rndesc - 1 and rndesc + 1

This snippet is not guaranteed to work if there are duplicate values in the input dataset - therefore row_number() will not provide expected values.

For example for the input:

DROP TABLE #b
CREATE TABLE #b (id INT IDENTITY, Total INT)
INSERT INTO #b 
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT  5 
UNION ALL SELECT  5 UNION ALL SELECT  5

Inner query returns (I guess it may differs on different servers):

Total   rnasc   rndesc
5       3      1
5       4      2
5       5      3
1       1      4
1       2      5

Runnig outer query will result into NULL (as there is no row where rnasc between rndesc - 1 and rndesc + 1)

Simple solution is add some surrogate key (I used identity column) into data set and include this column in OVER() clause:

SELECT avg(Total) median from
(
SELECT Total, 
rnasc = row_number() over(order by Total, id),
rndesc = row_number() over(order by Total DESC, id desc)
 from #b
) b
WHERE rnasc between rndesc - 1 and rndesc + 1

Now sorting order is guaranteed and inner query returns:

Total   rnasc   rndesc
5       5       1
5       4       2
5       3       3
1       2       4
1       1       5

And result is correct :)


t-clausens answer unfortunately does not work correctly, when there are lots of duplicate values in the list. Then the row numbers generated by different OVER clauses are not predictable in way, that this query works.

The following worked well in my case:

WITH SortedTable AS
    (
        SELECT Total, 
               rnasc, 
               rndesc = ROW_NUMBER() OVER(ORDER BY rnasc DESC)
        FROM ( 
               SELECT Total, 
                      rnasc = ROW_NUMBER() OVER(ORDER BY Total)
               FROM   [Table]
             ) SourceTable
    )
SELECT DISTINCT AVG(Total) median 
FROM   SortedTable
WHERE  rnasc = rndesc OR ABS(rnasc-rndesc) = 1

The WHERE clause now also clearly distinguishes between even and odd number of records.


I know this is an ancient question but for other people's sake I am posting this anyway. The performance of PERCENTILE_COUNT(0.5) is stupid slow. I have a table with 4.9 million records and PERCENTILE_COUNT(0.5) took 52 seconds. G Mastros answer above is better (and my favorite, except for mine) but it still took 35 seconds on my table. I tweaked his solution to do the following and it ran in 7 seconds without an index on the column. When I added an index it dropped to 2 seconds. All I did was replace the 50 PERCENT with an integer division of the record count in the table.

DECLARE @Cnt int = (SELECT COUNT(*) FROM [TABLE]);

SELECT ((
    SELECT TOP 1 Total
    FROM   (
            SELECT  TOP (@Cnt/2) Total
            FROM    [TABLE] A
            WHERE   Total IS NOT NULL
            ORDER BY Total
            ) AS A
    ORDER BY Total DESC) +
    (
    SELECT TOP 1 Total
    FROM   (
            SELECT  TOP (@Cnt/2) Total
            FROM    [TABLE] A
            WHERE   Total IS NOT NULL
            ORDER BY Total DESC
            ) AS A
    ORDER BY Total ASC)) / 2
0

精彩评论

暂无评论...
验证码 换一张
取 消