incremental way of counting quantiles for large set of data_问答_开发者

I need to count the quantiles for a large set of data.

Let's assume we can get the data only through some portions (i.e. one row of a large matrix). To count the Q3 quantile one need to get all the portions of the data and store it somewhere, then sort it and count the quantile:

List<double> allData = new List<double>();
// This is only an example; the portions of data are not really rows 开发者_如何学JAVAof some matrix
foreach(var row in matrix) 
{
    allData.AddRange(row);
}

allData.Sort();
double p = 0.75 * allData.Count;
int idQ3 = (int)Math.Ceiling(p) - 1;
double Q3 = allData[idQ3];

I would like to find a way of obtaining the quantile without storing the data in an intermediate variable. The best solution would be to count some parameters of mid-results for first row and then adjust it step by step for next rows.

Note:

These datasets are really big (ca 5000 elements in each row)
The Q3 can be estimated, it doesn't have to be an exact value.
I call the portions of data "rows", but they can have different leghts! Usually it varies not so much (+/- few hundred samples) but it varies!

This question is similar to “On-line” (iterator) algorithms for estimating statistical median, mode, skewness, kurtosis, but I need to count quantiles.

ALso there are few articles in this topic, i.e.:

An Efﬁcient Algorithm for the Approximate Median Selection Problem
Incremental quantile estimation for massive tracking

Before trying to implement these approaches, I wondered if there are maybe any other, quicker ways of counting the 0.25/0.75 quantiles?

I second the idea of using buckets. Don't limit yourself to 100 buckets - might as well use 1 million. The tricky part is to pick your bucket ranges so that everything doesn't end up in a single bucket. Probably the best way to estimate your bucket ranges is to take a reasonable random sample of your data, compute the 10% and 90% quantiles using the simple sort algorithm, then generate equal-sized buckets to fill that range. It isn't perfect, but if your data isn't from a super-weird distribution, it should work.

If you can't do random samples, you're in more trouble. You can pick an initial bucketing guess based on your expected data distribution, then while working through your data if any bucket (typically the first or last bucket) gets overfull, start over again with a new bucket range.

There is a more recent and much simpler algorithm for this that provides very good estimates of the extreme quantiles.

The basic idea is that smaller bins are used at the extremes in a way that both bounds the size of the data structure and guarantees higher accuracy for small or large q. The algorithm is available in several languages and many packages. The MergingDigest version requires no dynamic allocation ... once the MergingDigest is instantiated, no further heap allocation is required.

See https://github.com/tdunning/t-digest

Only retrieve the data you really need -- i.e., whatever value(s) is/are being used as the key for sorting, not everything else associated with it.
You can probably use Tony Hoare's Select algorithm to find your quantile more quickly than sorting all the data.

If your data has a Gaussian distribution, you can estimate the quantiles from the standard deviation. I assume your data isn't Gaussian distributed or you'd just be using the SD anyway.

If you can pass through your data twice, I'd do the following:

First pass, compute the max, min, SD and mean.
Second pass, divide the range [min,max] into some number of buckets (e.g. 100); do the same for (mean - 2*SD,mean + 2*SD) (with extra buckets for outliers). Then run through the data again, tossing numbers into these buckets.
Count buckets until you are at 25% and 75% of the data. If you want to get extra-fancy, you can interpolate between bucket values. (I.e. if you need 10% of a bucket to hit your 25th quantile, assume the value is 10% of the way from the low bound to the upper bound.)

This should give you a pretty good linear-time algorithm that works okay for most sets of not-entirely-perverse data.

Inspired by this answer I created a method that estimates the quantiles quite good. It is approximation close enough for my purposes.

The idea is following: the 0.75 quantile is in fact a median of all values that lies above the global median. And respectively, 0.25 quantile is a median of all values below the global median.

So if we can approximate the median, we can in similar way approximate the quantiles.

double median = 0;
double q1 = 0;
double q3 = 0;
double eta = 0.005;

foreach( var value in listOfValues) // or stream, or any other large set of data...
{
    median += eta * Math.Sign(p.Int - median);
}
// Second pass. We know the median, so we can count the quantiles.
foreach(var value in listOfValues)
{ 
    if(p.Int < median)
        q1 += eta*Math.Sign(p.Int - q1);
    else
        q3 += eta*Math.Sign(p.Int - q3);
}

Remarks:

If distribution of your data is strange, you will need to have bigger eta in order to fit to the strange data. But the accuracy will be worse.
If the distribution is strange, but you know the total size of your collection (i.e. N) you can adjust the eta parameter in this way: at the beggining set the eta to be almost equal some large value (i.e. 0.2). As the loop passes, lower the value of eta so when you reach almost the end of the collection, the eta will be almost equal 0 (for example, in loop compute it like that: eta = 0.2 - 0.2*(i/N);

q-digest is an approximate online algorithm that lets you compute quantile: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf

Here is an implementation:

https://github.com/airlift/airlift/blob/master/stats/src/main/java/io/airlift/stats/QuantileDigest.java