inferential statistics in Ruby_问答_开发者_运维开发者技术经验分享

So this is sort of a part mathematical, part ruby, part statistics question, and I'm just not sure where to start with something like this as it may be a much much larger thing than I'm prepared for at the moment, but maybe someone will be able to shine some light on how to implement a solution for this sort of thing.

Basically, I have a set of integers over time, say my hash looks something like:

{ :count => 20, :timestamp => 1304566372 }
{ :count => 23, :timestamp => 1304566382 }
{ :count => 23, :timestamp => 1304566392 }
{ :count => 24, :timestamp => 1304566402 }
{ :count => 25, :timestamp =开发者_运维技巧> 1304566412 }
{ :count => 22, :timestamp => 1304566422 }
{ :count => 12, :timestamp => 1304566432 } # <= outlier 
{ :count => 21, :timestamp => 1304566442 }
{ :count => 20, :timestamp => 1304566452 }

And this set of data would be much larger but this can just serve as an example, so what I want to do is find like the results which differ most from the average, however the integers will follow a sort of curve, so you can't just average the whole set. Picture like visitor analytics to a site.

I suppose my question is, using ruby, can I use math to sort of generalize a curve and find out which items vary furthest from the mean on that segment of the curve?

I'm not the best math guy so I may totally be using the wrong terms to describe this. Thanks a lot for any help or tips at all guys!

Assuming the integer values fall into a normal distribution, you might be able to apply the the 3-sigma rule (standard deviation) to find the outliers.

Let's say you want to quickly calculate the average and standard deviation of a list of integers. You could enhance Enumerable like so:

  module Enumerable

    def sum
      self.inject(0){|accum, i| accum + i }
    end

    def mean
      self.sum/self.length.to_f
    end

    def sample_variance
      m = self.mean
      sum = self.inject(0){|accum, i| accum +(i-m)**2 }
      (1/self.length.to_f*sum)
    end

    def standard_deviation
      return Math.sqrt(self.sample_variance)
    end

  end

Then, you would have to decide what the criteria is for outliers. Under the 3-sigma rule, 95% of all the integer values would fall within twice the value of the standard deviation (2 sigma) from the mean. So, you could say that any value whose difference from the mean is greater than 2 standard deviations is an outlier.

For example, assuming you summed up your count values into an array called a:

a = [ 20, 23, 23, 24, 25, 22, 12, 21, 29 ]
m = a.mean  
# => 22.11111111111111
sd = a.standard_deviation  
# => 4.331908597692872

# assuming Ruby 1.9.2
a.keep_if { |n| (m-n).abs > (2*sd) } 
# => results in 12 remaining

If you're just looking for a starting point, I'd suggest conducting a literature search [1] for "detecting outliers in time series data" If you can fit an equation of some sort to the data, you can look at how far points lie from the curve. If the system is more complicated and can't be easily modeled, there are a number of strategies you can follow, for example...

Just look at the delta in count between data points. In your series the list of deltas is [3,0,1,1,-3,-10,9,-1]. You can look for values that exceed the average of this list by more than a few standard deviations. In effect, you're looking for spikes by looking for large changes in the slope of the line.
Look at smaller windows of 3 to 5 or so points, e.g. first look at points 1,2,3 then points 2,3,4, then 3,4,5, etc. This is similar to the first approach, but the algorithm would be a bit different.

With more information about the nature of the data one could probably pick some sort of optimal algorithm, but quick-and-dirty might be close enough.

[1] This is an old-school term that's just a fancy way of saying "google"