So this is sort of a part mathematical, part ruby, part statistics question, and I'm just not sure where to start with something like this as it may be a much much larger thing than I'm prepared for at the moment, but maybe someone will be able to shine some light on how to implement a solution for this sort of thing.
Basically, I have a set of integers over time, say my hash looks something like:
{ :count => 20, :timestamp => 1304566372 }
{ :count => 23, :timestamp => 1304566382 }
{ :count => 23, :timestamp => 1304566392 }
{ :count => 24, :timestamp => 1304566402 }
{ :count => 25, :timestamp =开发者_运维技巧> 1304566412 }
{ :count => 22, :timestamp => 1304566422 }
{ :count => 12, :timestamp => 1304566432 } # <= outlier
{ :count => 21, :timestamp => 1304566442 }
{ :count => 20, :timestamp => 1304566452 }
And this set of data would be much larger but this can just serve as an example, so what I want to do is find like the results which differ most from the average, however the integers will follow a sort of curve, so you can't just average the whole set. Picture like visitor analytics to a site.
I suppose my question is, using ruby, can I use math to sort of generalize a curve and find out which items vary furthest from the mean on that segment of the curve?
I'm not the best math guy so I may totally be using the wrong terms to describe this. Thanks a lot for any help or tips at all guys!
Assuming the integer values fall into a normal distribution, you might be able to apply the the 3-sigma rule (standard deviation) to find the outliers.
Let's say you want to quickly calculate the average and standard deviation of a list of integers. You could enhance Enumerable like so:
module Enumerable
def sum
self.inject(0){|accum, i| accum + i }
end
def mean
self.sum/self.length.to_f
end
def sample_variance
m = self.mean
sum = self.inject(0){|accum, i| accum +(i-m)**2 }
(1/self.length.to_f*sum)
end
def standard_deviation
return Math.sqrt(self.sample_variance)
end
end
Then, you would have to decide what the criteria is for outliers. Under the 3-sigma rule, 95% of all the integer values would fall within twice the value of the standard deviation (2 sigma) from the mean. So, you could say that any value whose difference from the mean is greater than 2 standard deviations is an outlier.
For example, assuming you summed up your count
values into an array called a
:
a = [ 20, 23, 23, 24, 25, 22, 12, 21, 29 ]
m = a.mean
# => 22.11111111111111
sd = a.standard_deviation
# => 4.331908597692872
# assuming Ruby 1.9.2
a.keep_if { |n| (m-n).abs > (2*sd) }
# => results in 12 remaining
If you're just looking for a starting point, I'd suggest conducting a literature search [1] for "detecting outliers in time series data" If you can fit an equation of some sort to the data, you can look at how far points lie from the curve. If the system is more complicated and can't be easily modeled, there are a number of strategies you can follow, for example...
Just look at the delta in
count
between data points. In your series the list of deltas is[3,0,1,1,-3,-10,9,-1]
. You can look for values that exceed the average of this list by more than a few standard deviations. In effect, you're looking for spikes by looking for large changes in the slope of the line.Look at smaller windows of 3 to 5 or so points, e.g. first look at points 1,2,3 then points 2,3,4, then 3,4,5, etc. This is similar to the first approach, but the algorithm would be a bit different.
With more information about the nature of the data one could probably pick some sort of optimal algorithm, but quick-and-dirty might be close enough.
[1] This is an old-school term that's just a fancy way of saying "google"
精彩评论