I am using Perl to model a random variable (Y
) which is the sum of some ~15-40k independent Bernoulli rand开发者_如何学Goom variables (X_i
), each with a different success probability (p_i
). Formally, Y=Sum{X_i}
where Pr(X_i=1)=p_i
and Pr(X_i=0)=1-p_i
.
I am interested in quickly answering queries such as Pr(Y<=k)
(where k
is given).
Currently, I use random simulations to answer such queries. I randomly draw each X_i
according to its p_i
, then sum all X_i
values to get Y'
. I repeat this process a few thousand times and return the fraction of times Pr(Y'<=k)
.
Obviously, this is not totally accurate, although accuracy greatly increases as the number of simulations I use increases.
Can you think of a reasonable way to get the exact probability?
First, I would avoid using the rand
built-in for this purpose which is too dependent on the underlying C library implementation to be reliable (see, for example, my blog post pointing out that the range of rand
on Windows has cardinality 32,768).
To use the Monte-Carlo approach, I would start with a known good random generator, such as Rand::MersenneTwister or just use one of Random.org's services and pre-compute a CDF for Y
assuming Y
is pretty stable. If each Y
is only used once, pre-computing the CDF is obviously pointless.
To quote Wikipedia:
In probability theory and statistics, the Poisson binomial distribution is the discrete probability distribution of a sum of independent Bernoulli trials.
In other words, it is the probability distribution of the number of successes in a sequence of n independent yes/no experiments with success probabilities p1, …, pn. (emphasis mine)
Closed-Form Expression for the Poisson-Binomial Probability Density Function might be of interest. The article is behind a paywall:
and we discuss several of its advantages regarding computing speed and implementation and in simplifying analysis, with examples of the latter including the computation of moments and the development of new trigonometric identities for the binomial coefficient and the binomial cumulative distribution function (cdf).
As far as I recall, shouldn't this end up asymptotically as a normal distribution? See also this newsgroup thread: http://newsgroups.derkeiler.com/Archive/Sci/sci.stat.consult/2008-05/msg00146.html
If so, you can use Statistics::Distrib::Normal.
To obtain the exact solution you can exploit the fact that the probability distribution of the sum of two or more independent random variables is the convolution of their individual distributions. Convolution is a bit expensive but must be calculated only if the p_i
change.
Once you have the probability distribution, you can easily obtain the CDF by calculating the cumulative sum of the probabilities.
精彩评论