paqmo - 1 year ago 57

R Question

Suppose I have a series of n probabilities for success of independent Bernoulli trials, p1 to pn such that p1 != p2 != ... != pn. Give each trial a unique name.

`p <- c(0.5, 0.12, 0.7, 0.8, .02)`

a <- c("A","B","C","D","E")

I know from searching stack exchange (e.g., here and here) that I can find the cdf, pmf, etc. using the Poisson Binomial distribution function.

What I'm interested in is the exact probability of every possible combination of success and failures. (E.g. If I drew a probability tree, I want to know the probability at the end of each branch.)

`all <- prod(p)`

all

[1] 0.000672

o1 <- (0.5 * (1-0.12) * 0.7 * 0.8 * .02)

o1

[1] 0.004928

o2 <- (0.5 * 0.12 * (1-0.7) * 0.8 * .02)

o2

[1] 0.000288

...for all 2^5 possible combinations of success/failure.

What's an efficient way to go about this in R?

In the case of my actual data set, the number of trials is 19, so we're talking about 2^19 total paths on the probability tree.

Answer Source

The key to making this computation fast is to do it in log-probability space so that the product for each branch of the tree is a sum that can be computed as the inner sum of a matrix multiply. In this manner, all the branches can be computed together in vectorized fashion.

First, we construct an enumeration of all branches. For this, we use the `intToBin`

function from `R.utils`

package:

```
library(R.utils)
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
```

where `n`

is the number of Bernoulli variables. For your example, `n=5`

:

```
matrix(enum.branches, nrow=n)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17]
##[1,] "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "0" "1"
##[2,] "0" "0" "0" "0" "0" "0" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1" "0"
##[3,] "0" "0" "0" "0" "1" "1" "1" "1" "0" "0" "0" "0" "1" "1" "1" "1" "0"
##[4,] "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0"
##[5,] "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0"
## [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32]
##[1,] "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1" "1"
##[2,] "0" "0" "0" "0" "0" "0" "0" "1" "1" "1" "1" "1" "1" "1" "1"
##[3,] "0" "0" "0" "1" "1" "1" "1" "0" "0" "0" "0" "1" "1" "1" "1"
##[4,] "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1" "0" "0" "1" "1"
##[5,] "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1" "0" "1"
```

results in a matrix where each column is the outcomes from a branch of the probability tree.

Now, use that to construct a matrix of log probabilities of the same size as `enum.branches`

where the value is `log(p)`

if `enum.branches=="1"`

and `log(1-p)`

otherwise. For your data, with `p <- c(0.5, 0.12, 0.7, 0.8, .02)`

, this is:

```
logp <- matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n)
```

Then, sum the log-probabilities and take the exponential to get the product of the probabilities:

```
result <- exp(rep(1,n) %*% logp)
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
##[1,] 0.025872 0.000528 0.103488 0.002112 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05
[,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
##[1,] 0.014112 0.000288 0.008232 0.000168 0.032928 0.000672 0.025872 0.000528 0.103488 0.002112
[,21] [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30]
##[1,] 0.060368 0.001232 0.241472 0.004928 0.003528 7.2e-05 0.014112 0.000288 0.008232 0.000168
[,31] [,32]
##[1,] 0.032928 0.000672
```

The `result`

will be in the same order as the numeration of branches in `enum.branches`

.

We can encapsulate the computation into a function:

```
enum.prob.product <- function(n, p) {
enum.branches <- unlist(strsplit(intToBin(seq_len(2^n)-1),split=""))
exp(rep(1,n) %*% matrix(ifelse(enum.branches == "1", rep(log(p), 2^n), rep(log(1-p), 2^n)), nrow=n))
}
```

Timing this with `19`

independent Bernoulli variables:

```
n <- 19
p <- runif(n)
system.time(enum.prob.product(n,p))
## user system elapsed
## 24.064 1.470 26.082
```

This is on my 2 GHz MacBook (circa 2009). It should be noted that the computation itself is quite fast; it is the enumeration of the branches of the probability tree (I would guess the `unlist`

within that) that is taking the bulk of the time. Any suggestions from the community on another approach to do that will be appreciated.