I have an iterator of numbers, for example a file object:
f = open("datafile.dat")
now I want to compute:
mean = get_mean(f)
sigma = get_sigma(f, mean)
What is the best imp开发者_如何学JAVAlementation? Suppose that the file is big and I would like to avoid to read it twice.
If you want to iterate once, you can write your sum function:
def mysum(l):
s2 = 0
s = 0
for e in l:
s += e
s2 += e * e
return (s, s2)
and use the result in your sigma
function.
Edit: now you can calculate the variance like this: (s2 - (s*s) / N) / N
By taking account of @Adam Bowen's comment,
keep in mind that if we use mathematical tricks and transform the original formulas
we may degrade the results.
I think Nick D has the correct answer.
Assuming you want to compute both mean and variance in one sweep of the file (and you don't really want two functions that have to be called one after the other), you can collect the sum of the values and of their squares and them use such sums (toghether with the number of read elements) to compute at the same time mean and variance.
There are some numerical stability issues, but the idea in
http://en.wikipedia.org/wiki/Computational_formula_for_the_variance
is the basic ingredient you need. Some more details are at
http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance
where I suggest you to read the "Naïve algorithm".
Hope this helps,
Massimo
You can compute both in one pass. See:
http://www.johndcook.com/standard_deviation.html
Make a list from the iterable, or use itertools.tee()
.
I am not sure there is much choice.
You will have to iterate your numbers twice in any case as the standard deviation will require the mean information on each value.
If you have enough memory, you can gain on the I/O access by loading your file in memory during the first iteration but that is about it IMO.
As I feel that there are good elements scattered in multiple answers, I would like to summarize:
If your file is too big to conveniently fit in memory, and if you want a good precision in the variance, you do need to read the file twice (with one pass, the variance is the difference between two large numbers, which is not precise because of floating point limitations). Note that your operating system is likely to provide some automatic speed-up for the second file reading, as it may still be in RAM during the second pass.
If you do not care for the precision of the variance, you can simply iterate once over the file and calculate the quantities suggested by Nick D, with the details provided in the comment by Adam Bowen.
You have two solutions
Make a list out of your iterator and loop it as many time as you wish. Drawback is everything will be in memory, so not suitable if your file is big. Simple use of itertools.tee also will not save you
There is no other solution , unless , you do not need to pass output of get_mean to get_sigma, because in that case they can only be in series, but if you remove this restriction then you can run both functions in parallel using threads, and use itertools.tee to have two iterators from one
You can use map reduce in an elegant fashion way
sample is the list you want to get its variance
sample = [a,b,c, ...]
mean = float(reduce(lambda x,y : x+y, sample)) / len(sample)
variance = reduce(lambda x,y: x+y, map(lambda xi: (xi-mean)**2, sample))/ len(sample)
In a succinct line of code:
variance = reduce(lambda x,y: x+y, map(lambda xi: (xi-(float(reduce(lambda x,y : x+y, sample)) / len(sample)))**2, sample))/ len(sample)
精彩评论