How to keep a dynamical histogram?_问答_开发者

is there a known algorithm + data-structure to maintain a dynamical histogram?

Imagine I have a stream of data (x_1, w_1) , (x_2, w_2), ... where the x_t are doubles, that represent some measured variable and w_t is the associated weight.

I could just do the obvious (pseudo-python code):

x0,xN = 0, 10
numbins = 100
hist = [(x0 + i * delta , 0) for i in xrange(numbins)]
def updateHistogram(x, w):
    k = lookup(x,  hist)    #find the adequated bin where to put x
    hist[k][1] += 1

But I have some problems with that when I have a continuous stream of data. I don't have the full dataset in hands, and I have to check up the histogram in between the data gathering. And I have no expectation about:

the ideal bin sizes for not ending up with a lot of empty bins,
the range of the data

So I'd like to define the bins dynamically. I 开发者_如何转开发could do the stupid thing:

 for x in data_stream:
      data.append(x)
      hist = make_histogram(data)

but I guess this will get slow very quickly...

If the all weights where equal one of the things I thought was storing the data in a sorted array and inserting new data in a way that kept the array sorted. This way I could have:

data = sortedarray();
for x in data_stream:
     data.insert(x)
     bins = [ data[int(i * data.size()/numbins)] for i in xrange(numbins)]

and the count inside each bin would be equal to data.size()/numbins for all bins.

I can't think of a way of including the weights in this though... does anyone have a suggestion? (knowledge about c++ libraries that do this would be welcomed also).

EDIT: (for the asked clarification)

The x_t are floating point numbers. To calculate the histogram I must divide the continuous range where the x's belong in a number of bins. So I'll have a sequence of numbers bin[0], bin[1], etc... so I must determine for what i does bin[i] < x < bin[i+1].

This is how you usually do a histogram when you have all the data beforehand. You'd then know the limits max(x) and min(x) and it would be easy to determine adequate bins. You could have them equally spaced between min(x) and max(x), for example.

If you don't know the range beforehand, you can't determine the bins. You could receive an x that doesn't fall in any bin. Or you could many empty bins cause you chose too big a range to create the bins.

How to determine the number of bins

There are a number of rules for determining the number of bins in a histogram. For your problem, I would go with Scott's choice:

bin_width = 3.5*sd*n^{-1/3}

where sd is the standard deviation and n is the number of data points. Crucially, you can use an online algorithm for calculating the standard deviation. The number of bins, k, is given by:

k = ceil((max(x) - min(x))/bin_width)

Storage the data

Suppose we have observed N data points. Then the confidence interval for the standard deviation,

Lower limit: sd*sqrt((N-1)/CHIINV((alpha/2), N-1))
Upper limit: sd*sqrt((N-1)/CHIINV(1-(alpha/2), N-1))

where CHIINV is a value from the chi-squared distribution. When N = 1000, the CI for the sd is:

(0.96*sd, 1.05*sd)

and so a 95% CI the bin-width is:

(3.5*0.96*sd*1000^{-1/3}, 3.5*1.05*sd*1000^{-1/3})
(0.336*sd, 0.3675*sd)

You can get something similar for the number of bins.

Algorithm

Store all the data until you have a good estimate of the optimal bin-width, say when the lower and upper CI for the number of bins are equal.
Create the number of bins and put data in bins.
All new data points are put into the bins, then discarded.

Comments

The Freedman–Diaconis' rule is better for choosing the number of bins, but it involves the inter-quantile range which is a bit more tricky calculate online.
Technically, the CI interval isn't correct when the data is sequential. But if you set a reasonable minimal number of data points to observe, say ~100 or 1000, you should be OK.
This assumes the data all follows the same distribution.
The number of bins depends on n^{-1/3}. If you know roughly how many points to expect, i.e. 10^5, 10^6 or 10^7, then you could create smaller bins with the expectation of changing the bin width in the future.

It sounds as if you want an implementation of the following abstract data type.

insert(x, w): add item x to the collection with weight x
select(p): return the item greater than a p weighted fraction of the items

For example, select(0) returns the minimum, select(0.5) returns the weighted median, and select(1) returns the maximum.

I would implement this ADT in one of two ways. If selection is infrequent, I'd put the data in an array and use a linear-time selection algorithm, for O(1)-time inserts and O(n)-time selects. If selection is frequent, I'd use a binary search tree where each node stores the total weight in its subtree. For example, after

insert(2, 10)
insert(1, 5)
insert(3, 100)
insert(4, 20)

the tree might look like

   2 (135)
  / \
 /   \
1 (5) 4 (120)
     /
    /
   3 (100)

Now, to find the weighted median, multiply 135 by 0.5 and get 67.5 as the desired "index". Starting at the root 2, we find that 5 is less than 67.5, so the item is not in the left subtree and we subtract 5 to obtain 62.5, the index into the remainder of the tree. Since 135 - 120 = 15 is less than 62.5, the median isn't 2. We subtract 15 from 62.5 to obtain 47.5 and descend to 4. At 4, we find that 100 is greater than 47.5, so 3 is the median.

Assuming a balanced tree, the running time of both insert and select is O(log n). If I were implementing from scratch, I'd probably opt for a splay tree.

ROOT is the tool used by particle physicists for this kind of work...and it comes with python bindings. Mind you, it is not a lightweight piece of software.

In c++ you would do something like

TH1D hist("hist","longer title for hist",numbins,lowlimit,highimit);

...

for (int i=0; i<num; ++i){
   hist.Fill(x[i],w[i]);
}

...

hist.Draw();

ROOT provides no built-in solution to the binning problem, inputs below/above the binned range are added to the under-/over-flow bins.

You can initially set the binning over a wide range and convert to a shorter range at a later time. I think the method is Rebin. All the obvious limitations apply.