Fastest Way to Count Distinct Values in a Column, Including NULL Values_问答_开发者

The Transact-Sql Count Distinct operation counts all non-null values in a column. I need to count the number of distinct values per column in a set of tables, including null values (so if there is a null in the column, the result should be (Select Count(Distinct COLNAME) From TABLE) + 1.

This is going to be repeated over every column in every table in the DB. Includes hundreds of tables, some of which have over 1M rows. Because this needs to be done over every single column, adding Indexes for every column is not a good option.

This will be done as part of an ASP.net site, so integration with code logic is also ok (i.e.: this doesn't have to be completed as part of one query, though if that can be done with good performance, then even better).

What is the most efficient way to do this?

Update After Testing

I tested the different methods from the answers given on a good representative table. The table has 3.2 million records, dozens of columns (a few with indexes, most without). One column has 3.2 million unique values. Other columns range from all Null (one value) to a max of 40K unique values. For each method I performed four tests (with multiple attempts at each, averaging the results): 20 columns at one time, 5 columns at one time, 1 column with many values (3.2M) and 1 column with a small number of values (167). Here are the results, in order of fastest to slowest

Count/GroupBy (Cheran)
CountDistinct+SubQuery (Ellis)
dense_rank (Eriksson)
Count+Max (Andriy)

Testing Results (in seconds):

   开发者_运维问答Method          20_Columns   5_Columns   1_Column (Large)   1_Column (Small)
1) Count/GroupBy      10.8          4.8            2.8               0.14       
2) CountDistinct      12.4          4.8            3                 0.7         
3) dense_rank        226           30              6                 4.33 
4) Count+Max          98.5         44             16                12.5

Notes:

Interestingly enough, the two methods that were fastest (by far, with only a small difference in between then) were both methods that submitted separate queries for each column (and in the case of result #2, the query included a subquery, so there were really two queries submitted per column). Perhaps because the gains that would be achieved by limiting the number of table scans is small in comparison to the performance hit taken in terms of memory requirements (just a guess).
Though the dense_rank method is definitely the most elegant, it seems that it doesn't scale well (see the result for 20 columns, which is by far the worst of the four methods), and even on a small scale just cannot compete with the performance of Count.

Thanks for the help and suggestions!

SELECT COUNT(*)
FROM (SELECT ColumnName
      FROM TableName
      GROUP BY ColumnName) AS s;

GROUP BY selects distinct values including NULL. COUNT(*) will include NULLs, as opposed to COUNT(ColumnName), which ignores NULLs.

I think you should try to keep the number of table scans down and count all columns in one table in one go. Something like this could be worth trying.

;with C as
(
  select dense_rank() over(order by Col1) as dnCol1,
         dense_rank() over(order by Col2) as dnCol2
  from YourTable
)
select max(dnCol1) as CountCol1,
       max(dnCol2) as CountCol2
from C

Test the query at SE-Data

A development on OP's own solution:

SELECT
  COUNT(DISTINCT acolumn) + MAX(CASE WHEN acolumn IS NULL THEN 1 ELSE 0 END)
FROM atable

Run one query that Counts the number of Distinct values and adds 1 if there are any NULLs in the column (using a subquery)

Select Count(Distinct COLUMNNAME) +
       Case When Exists 
                 (Select * from TABLENAME Where COLUMNNAME is Null) 
            Then 1 Else 0 End
From TABLENAME

You can try:

count(
distinct coalesce(
    your_table.column_1, your_table.column_2
    -- cast them if you want replace value from column are not same type
    )
) as COUNT_TEST

Function coalesce help you combine two columns with replace not null values.

I used this in mine case and success with correctly result.

Not sure this would be the fastest but might be worth testing. Use case to give null a value. Clearly you would need to select a value for null that would not occur in the real data. According to the query plan this would be a dead heat with the count(*) (group by) solution proposed by Cheran S.

    SELECT 
        COUNT( distinct
                 (case when [testNull] is null then 'dbNullValue' else [testNull] end)
             )
    FROM [test].[dbo].[testNullVal]

With this approach can also count more than one column

    SELECT 
        COUNT( distinct
                 (case when [testNull1] is null then 'dbNullValue' else [testNull1] end)
             ),
        COUNT( distinct
                 (case when [testNull2] is null then 'dbNullValue' else [testNull2] end)
             )
    FROM [test].[dbo].[testNullVal]