How to improve the performance when use the count distinct_问答_开发者

My DB is SQL SERVER 2008

I have a large table with 100 million rows and 50 columns.

All the datatype is int.

Then , My query is like ....

Select Count(distinct col5) , Count(distinct col8) , Sum(Col 30) , Sum(Col 49) 
Group by Col1 

Select Count(distinct col5) , Count(distinct col8) , S开发者_运维百科um(Col 30) , Sum(Col 49) 
Group by Col1,col2

.....


Select Count(distinct col5) , Count(distinct col8) , Sum(Col 30) , Sum(Col 49) 
Group by Col1 ,Col2,Col3,Col4,Col6,Col7

(about 180 queries ...like above)

But the performance is very bad when I use the count distinct.

So , who can teach me how to improve it please?

and in my case how long might be cost with a best solution ?

very thanks your advices ....

These queries (without WHERE clauses) can hardly be optimized since they need to visit every record to calculate the sums no matter what.

Full table scan and a temporary table to hold the results is a best solution and that's what you most probably have in your plan.

Indexing the columns would increase performance, but inserts/Updates might get slower.

EDIT
Did some more testing with indices as suggested in the comments. I got these results: (with and without indices)

How to improve the performance when use the count distinct

Using this query: (MAXDOP limits the query to use one processor for more comparable plans, without the MAXDOP hint the cost was 36% vs 64%)

SELECT COUNT(DISTINCT fkCustomCultureID), SUM(fkCustomCultureTypeID)
FROM tblTest
GROUP BY fkCategoryTypeID
OPTION (MAXDOP 1)

On this table:

CREATE TABLE [dbo].[TblTest](
    [fkID] [sql_variant] NOT NULL,
    [fkCustomCultureID] [bigint] NOT NULL,
    [fkCategoryTypeID] [int] NOT NULL,
    [fkCustomCultureTypeID] [int] NOT NULL
) ON [PRIMARY]

GO

CREATE NONCLUSTERED INDEX [IX_Test1] ON [dbo].[TblTest] 
(
    [fkCustomCultureID] ASC,
    [fkCategoryTypeID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

If you don't always need all the columns, just going to throw it out there, have you considered splitting the table? It should be fine provided you can change the table in that way (which sadly usually is the case)

If you split the table into say 5 tables of (roughly) 10 columns and assigned an ID (clustered indexed) to each currently existing row, you would end up with having to scan up to 5 times less disk (provided all your columns are on the same, remembering that if you don't have an index then it will always retrieve all columns for everything).

Other than throwing an index on every column, I suggest you look at your queries and try and decide if there's some columns that are called very seldom and others that are called all the time (esp together). When I have done this in the past I saw considerable improvements as a direct result of splitting the table into "Always", "Often", "Seldom" and "Almost Never" tables along with a "Usually Together" table or two in there.

Also it might help to do this break along common group by clauses as it's likely that at least a few columns will be much more common than others.

If you really want better performance I would also suggest maybe looking at changing the disk drives and upgrading the ram on the SQL Server, your table should use about 20GB of space, how long would it take on the current disk drives to read 20GB of data off the disk into ram? That's going to be your lower limit on query execution time (unless it's always in ram, in which case even with this much data queries shouldn't take too long)