I am working on a time series of nume开发者_开发技巧ric values, such as those produced by a temperature sensor. I'd like to filter those values, roughly selecting those samples that form e.g. the top 10% of the received values.
The obvious solution of recording all samples and using any well-known algorithm for the extraction of the k-highest values is not possible in my case for two reasons:
The series may be infinite, memory is definitely not.
I'd like this algorithm to be usable in real-time, or at least in a streaming mode with predetermined latency.
The distribution of the values is not normal, nor is it consistent with any well-known distribution that I know of. Metrics that I already have available at any time include the mean, the variance and the skewness of the values that have already been received.
Unlike this question, I do not need perfect accuracy, although I would like to be able to tune the parameters of the selection algorithm.
I believe something similar is used in single-pass variable bit-rate (VBR) media codecs to allocate the available bandwidth to each frame, by determining the number of available bits. Unfortunately all the VBR algorithms I studied are too focused on DSP and media streams for me to understand and/or implement.
Are there any known algorithms that could help me deal with this issue? Any hints that would orient me towards the right direction would be greatly appreciated.
If you decide you are only interested in the last 10N items you can use two heaps, one of size N and one of size 9N, to keep track of the N highest items in the last 10N. When you see a new item first remove the oldest item. If it came from the small heap, take the largest item from the large heap and put it in the small. Now look at the new item and either put it straight in the large heap or take the smallest item from the small heap and transfer it to the large one before putting the new item in the small heap.
At any time you have a small heap full of high items and a large heap full of low items, and you know whether the latest item was in the top 10% of those 10N.
But is this really what you want? Note that if your samples rise steadily and then fall steadily over a period of time much larger than your 10N samples then nearly half the time the latest item will be in the top 10% - in fact it will be the largest item seen in the memory of 10N items.
There have been academic papers on finding approximaste quantiles of streaming data. One such is "Effective Computation of Biased Quantiles over Data Streams", by Cormode, Korn, Muthukrishnan, and Srivastava
You cannot get high accuracy without storing the whole stream and knowing something about the distribution.
Imagine a stream where the values are sorted in descending order. The 10% head of the list should be your answer but you have no idea how long the stream is so you have to store the whole stream because at any point it's possible that you processed less than 10% of the whole stream.
If you have to store the whole stream you don't looking for an online algorithm anymore.
Now if you know something about the distribution then perhaps you can do something better.. A naive algorithm would be to chunk the stream to slices and calculate the top 10% of each slice.
精彩评论