I have the following SQL table -
Date StoreNo Sales
23/4 34 4323.00
23/4 23 564.00
24/4 34 2345.00
etc
I am running a query that returns average sales, max sales and min sales for a certain period -
select avg(Sales), max(sales), min(sales)
from tbl_sales
where date between etc
But there are some values coming through in the min and max that are really extreme - perhaps because the data entry was bad, perhaps because some anomoly had occurred on that date and store.
What I'd like is a que开发者_如何学运维ry that returns average, max and min, but somehow excludes the extreme values. I am open to how this is done, but perhaps it would use standard deviations in some way (for example, only using data within x std devs of the true average).
Many thanks
In order to calculate the standard deviation, you need to iterate through all of the elements, so it would be impossible to do this in one query. The lazy way would be to just do it in two passes:
DECLARE
@Avg int,
@StDev int
SELECT @Avg = AVG(Sales), @StDev = STDEV(Sales)
FROM tbl_sales
WHERE ...
SELECT AVG(Sales) AS AvgSales, MAX(Sales) AS MaxSales, MIN(Sales) AS MinSales
FROM tbl_sales
WHERE ...
AND Sales >= @Avg - @StDev * 3
AND Sales <= @Avg + @StDev * 3
Another simple option that might work (fairly common in analysis of scientific data) would be to just drop the minimum and maximum x values, which works if you have a lot of data to process. You can use ROW_NUMBER
to do this in one statement:
WITH OrderedValues AS
(
SELECT
Sales,
ROW_NUMBER() OVER (ORDER BY Sales) AS RowNumAsc,
ROW_NUMBER() OVER (ORDER BY Sales DESC) AS RowNumDesc
)
SELECT ...
FROM tbl_sales
WHERE ...
AND Sales >
(
SELECT MAX(Sales)
FROM OrderedValues
WHERE RowNumAsc <= @ElementsToDiscard
)
AND Sales <
(
SELECT MIN(Sales)
FROM OrderedValues
WHERE RowNumDesc <= @ElementsToDiscard
)
Replace ROW_NUMBER
with RANK
or DENSE_RANK
if you want to discard a certain number of unique values.
Beyond these simple tricks you start to get into some pretty heavy stats. I have to deal with similar kinds of validation and it's far too much material for a SO post. There are a hundred different algorithms that you can tweak in a dozen different ways. I would try to keep it simple if possible!
Expanding on DuffyMo's post you could do something like
With SalesStats As
(
Select Sales, NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
From tbl_Sales
)
Select Avg( Sales ), Max( Sales ), Min( Sales )
From SalesStats
Where NtileNum Between 5 And 95
This will exclude the lowest 5% and highest 95%. If you have numbers that vary wildly, you may find that the Average isn't a quality summary statistic and should consider using median. You can do that by doing something like:
With SalesStats As
(
Select NTILE( 100 ) OVER ( Order By Sales ) As NtileNum
, ROW_NUMBER() OVER ( Order By Id ) As RowNum
From tbl_Sales
)
, TotalSalesRows
(
Select COUNT(*) As Total
From tbl_Sales
)
, Median As
(
Select Sales
From SalesStats
Cross Join TotalSalesRows
Where RowNum In ( (TotalRows.Total + 1) / 2, (TotalRows.Total + 2) / 2 )
)
Select Avg( Sales ), Max( Sales ), Min( Sales ), Median.Sales
From SalesStats
Cross Join Median
Where NtileNum Between 5 And 95
Maybe what you're looking for are percentiles.
Standard deviation tends to be sensitive to outliers, since it's calculated using the square of the difference between a value and the mean.
Maybe a more robust, less sensitive measure like absolute value of difference between a value and the mean would be more appropriate in your case.
精彩评论