paqmo - 1 month ago 7
R Question

# Enumerate all possible combined probabilities of a series of Bernoulli trials with different probabilities

Suppose I have a series of n probabilities for success of independent Bernoulli trials, p1 to pn such that p1 != p2 != ... != pn. Give each trial a unique name.

``````    p <- c(0.5, 0.12, 0.7, 0.8, .02)
a <- c("A","B","C","D","E")
``````

I know from searching stack exchange (e.g., here and here) that I can find the cdf, pmf, etc. using the Poisson Binomial distribution function.

What I'm interested in is the exact probability of every possible combination of success and failures. (E.g. If I drew a probability tree, I want to know the probability at the end of each branch.)

``````    all <- prod(p)
all
[1] 0.000672
o1 <- (0.5 * (1-0.12) * 0.7 * 0.8 * .02)
o1
[1] 0.004928
o2 <- (0.5 * 0.12 * (1-0.7) * 0.8 * .02)
o2
[1] 0.000288
``````

...for all 2^5 possible combinations of success/failure.

In the case of my actual data set, the number of trials is 19, so we're talking about 2^19 total paths on the probability tree.

The key to making this computation fast is to do it in log-probability space so that the product for each branch of the tree is a sum that can be computed as the inner sum of a matrix multiply. In this manner, all the branches can be computed together in vectorized fashion.

First, we construct an enumeration of all branches. For this, we use the `intToBin` function from `R.utils` package:

``````library(R.utils)
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
``````

where `n` is the number of Bernoulli variables. For your example, `n=5`:

``````matrix(enum.branches, nrow=n)
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
##[1,] "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"   "0"   "0"   "0"   "0"   "0"   "0"   "1"
##[2,] "0"  "0"  "0"  "0"  "0"  "0"  "0"  "0"  "1"  "1"   "1"   "1"   "1"   "1"   "1"   "1"   "0"
##[3,] "0"  "0"  "0"  "0"  "1"  "1"  "1"  "1"  "0"  "0"   "0"   "0"   "1"   "1"   "1"   "1"   "0"
##[4,] "0"  "0"  "1"  "1"  "0"  "0"  "1"  "1"  "0"  "0"   "1"   "1"   "0"   "0"   "1"   "1"   "0"
##[5,] "0"  "1"  "0"  "1"  "0"  "1"  "0"  "1"  "0"  "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"
##     [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
##[1,] "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"
##[2,] "0"   "0"   "0"   "0"   "0"   "0"   "0"   "1"   "1"   "1"   "1"   "1"   "1"   "1"   "1"
##[3,] "0"   "0"   "0"   "1"   "1"   "1"   "1"   "0"   "0"   "0"   "0"   "1"   "1"   "1"   "1"
##[4,] "0"   "1"   "1"   "0"   "0"   "1"   "1"   "0"   "0"   "1"   "1"   "0"   "0"   "1"   "1"
##[5,] "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"   "0"   "1"
``````

results in a matrix where each column is the outcomes from a branch of the probability tree.

Now, use that to construct a matrix of log probabilities of the same size as `enum.branches` where the value is `log(p)` if `enum.branches=="1"` and `log(1-p)` otherwise. For your data, with `p <- c(0.5, 0.12, 0.7, 0.8, .02)`, this is:

``````logp <- matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n)
``````

Then, sum the log-probabilities and take the exponential to get the product of the probabilities:

``````result <- exp(rep(1,n) %*% logp)
##         [,1]     [,2]     [,3]     [,4]     [,5]     [,6]     [,7]     [,8]     [,9]   [,10]
##[1,] 0.025872 0.000528 0.103488 0.002112 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05
[,11]    [,12]    [,13]    [,14]    [,15]    [,16]    [,17]    [,18]    [,19]    [,20]
##[1,] 0.014112 0.000288 0.008232 0.000168 0.032928 0.000672 0.025872 0.000528 0.103488 0.002112
[,21]    [,22]    [,23]    [,24]    [,25]   [,26]    [,27]    [,28]    [,29]    [,30]
##[1,] 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05 0.014112 0.000288 0.008232 0.000168
[,31]    [,32]
##[1,] 0.032928 0.000672
``````

The `result` will be in the same order as the numeration of branches in `enum.branches`.

We can encapsulate the computation into a function:

``````enum.prob.product <- function(n, p) {
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
exp(rep(1,n) %*% matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n))
}
``````

Timing this with `19` independent Bernoulli variables:

``````n <- 19
p <- runif(n)
system.time(enum.prob.product(n,p))
##   user  system elapsed
## 24.064   1.470  26.082
``````

This is on my 2 GHz MacBook (circa 2009). It should be noted that the computation itself is quite fast; it is the enumeration of the branches of the probability tree (I would guess the `unlist` within that) that is taking the bulk of the time. Any suggestions from the community on another approach to do that will be appreciated.