how to get average that ignores outliers?_问答_开发者

say I have a postgresql table with the following values:

id | value
----------
1  | 4
2开发者_Python百科  | 8
3  | 100
4  | 5
5  | 7

If I use postgresql to calculate the average, it gives me an average of 24.8 because the high value of 100 has great impact on the calculation. While in fact I would like to find an average somewhere around 6 and eliminate the extreme(s).

I am looking for a way to eliminate extremes and want to do this "statistically correct". The extreme's cannot be fixed. I cannot say; If a value is over X, it has to be eliminated.

I have been bending my head on the postgresql aggregate functions but cannot put my finger on what is right for me to use. Any suggestions?

Postgresql can also calculate the standard deviation.

You could take only the data points which are in the average() +/- 2*stddev() which would roughly correspond to the 90% datapoints closest to the average.

Of course 2 can also be 3 (95%) or 6 (99.995%) but do not get hung up on the numbers because in the presence of a collection outliers you are no longer dealing with a normal distribution.

Be very careful and validate that it works as expected.

I cannot say; If a value is over X, it has to be eliminated.

Well, you could use having and a subselect to eliminate outliers, something like:

HAVING value < (
 SELECT 2 * avg(value)
 FROM   mytable
 GROUP BY ...
)

(Or, for that matter, use a more complex version to eliminate anything above 2 or 3 standard deviations if you want something that will be better at eliminating only outliers.)

The other option is to look at generating a median value, which is a fairly statistically sound way of accounting for outliers; happily there are three reasonable examples of just that: one from the Postgresql Wiki, one built as an Oracle compatability layer, and another from the PostgreSQL Journal. Note the caveats around how precisely/accurately they implement medians.

Here's an aggregate function which will calculate the trimmed mean for a set of values, excluding values outside N standard deviations from the mean.

Example:

DROP TABLE IF EXISTS foo;
CREATE TEMPORARY TABLE foo (x FLOAT);
INSERT INTO foo VALUES (1);
INSERT INTO foo VALUES (2);
INSERT INTO foo VALUES (3);
INSERT INTO foo VALUES (4);
INSERT INTO foo VALUES (100);

SELECT avg(x), tmean(x, 2), tmean(x, 1.5) FROM foo;

--  avg | tmean | tmean 
-- -----+-------+-------
--   22 |    22 |   2.5

Code:

DROP TYPE IF EXISTS tmean_stype CASCADE;

CREATE TYPE tmean_stype AS (
  deviations FLOAT,
    count INT,
    acc FLOAT,
    acc2 FLOAT,
    vals FLOAT[]
);

CREATE OR REPLACE FUNCTION tmean_sfunc(tmean_stype, float, float)
RETURNS tmean_stype AS $$
    SELECT $3, $1.count + 1, $1.acc + $2, $1.acc2 + ($2 * $2), array_append($1.vals, $2);
$$ LANGUAGE SQL;

CREATE OR REPLACE FUNCTION tmean_finalfunc(tmean_stype)
RETURNS float AS $$
DECLARE
    fcount INT;
    facc FLOAT;
    mean FLOAT;
    stddev FLOAT;
    lbound FLOAT;
    ubound FLOAT;
    val FLOAT;
BEGIN
    mean := $1.acc / $1.count;
    stddev := sqrt(($1.acc2 / $1.count) - (mean * mean));
    lbound := mean - stddev * $1.deviations;
    ubound := mean + stddev * $1.deviations;
    -- RAISE NOTICE 'mean: % stddev: % lbound: % ubound: %', mean, stddev, lbound, ubound;

    fcount := 0;
    facc := 0;
    FOR i IN array_lower($1.vals, 1) .. array_upper($1.vals, 1) LOOP
        val := $1.vals[i];
        IF val >= lbound AND val <= ubound THEN
            fcount := fcount + 1;
            facc := facc + val;
        END IF; 
    END LOOP;

    IF fcount = 0 THEN
        return NULL;
    END IF;
    RETURN facc / fcount;
END;
$$ LANGUAGE plpgsql;

CREATE AGGREGATE tmean(float, float)
(
    SFUNC = tmean_sfunc,
    STYPE = tmean_stype,
    FINALFUNC = tmean_finalfunc,
    INITCOND = '(-1, 0, 0, 0, {})'
);

Gist (which should be identical): https://gist.github.com/4458294

Mind using the ntile window function. It allows you to easily isolate extreme values from the result set.

Let's say you want to cut 10% from both sides of the result set. Then passing the value of 10 to ntile and looking for values between 2 and 9 would give you the desired result. Keep also in mind that if you have less than 10 records, you might accidentally cut more than 20%, so be sure to check the total amount of records as well.

WITH yyy AS (
  SELECT
    id,
    value,
    NTILE(10) OVER (ORDER BY value) AS ntiled,
    COUNT(*) OVER () AS counted
  FROM
    xxx)
SELECT
  *
FROM
  yyy
WHERE
  counted < 10 OR ntiled BETWEEN 2 AND 9;

You can use IQR to filter outliers. PL/pgSQL code:

select percentile_cont(0.25) WITHIN GROUP (ORDER BY value)
  into q1
  from table;
select percentile_cont(0.75) WITHIN GROUP (ORDER BY value)
  into q3
  from table;

iqr := q3 - q1;
min := q1 - 1.5 * iqr;
max := q3 + 1.5 * iqr;

select value
  into result
  from table
  where value >= min and value <= max;
return result;