I am looking for an algorithm to generate a histogram over a lar开发者_如何学运维ge amount of streaming data, the max and min are not known in advance but standard deviation and mean are in a particular range.
I appreciate your ideas.
Cheers,
I just found one solution. Sec. 2.2 of "On-line histogram building from A streaming parallel decision tree algorithm" paper. The algo is implemented by NumericHistogram class in Hive project :
A generic, re-usable histogram class that supports partial aggregations. The algorithm is a heuristic adapted from the following paper: Yael Ben-Haim and Elad Tom-Tov, "A streaming parallel decision tree algorithm", J. Machine Learning Research 11 (2010), pp. 849--872. Although there are no approximation guarantees, it appears to work well with adequate data and a large (e.g., 20-80) number of histogram bins.
I use a package called "GoHistogram" which provides two streaming approximattion histograms (NumericHistogram and Weighted Numeric Histogram). It is implemented in Golang (https://code.google.com). Here is the link:
https://github.com/VividCortex/gohistogram
Standard deviation and mean do not matter for a histogram. Simply choose your resolution and draw a bar as high as you have hits for its range. This will, of course, get more expensive with a higher resolution. You can try adjusting the resolution by trying to fit the existing data into a normal curve (or whatever model you like) and finding the standard deviation to choose a reasonable granularity.
Edit: Read it wrong the first time around. If you know the approximate standard deviation, you can choose reasonable sizes for your histogram groups from the get-go. Just compare every new entry to your current min and max and adjust your range accordingly.
精彩评论