Iceberg Slim Iceberg Slim - 7 months ago 122
R Question

Probabilities of events grouped by their outcome

Say there are $n$ independent events. Each has a probability $p_n$ and an associated loss $l_n$. My goal is to produce a list of all possible loss amounts and their associated probabilities.

Eventually I would like to extend this to sets of 10-20 events with variable probabilities and loss amounts. This will all be done in R.

The various outcomes are given by the power set, e.g. for three events: (null), (A), (B), (C), (A and B), (A and C), (B and C), (A and B and C). The probability of each of these outcomes can be found by taking the product of the probabilities in each subset, and the total loss by taking the sum of losses in each subset.

My problem is how to aggregate by the loss amounts, i.e. to find all unique loss amounts in the power set and produce their probabilities.

I feel like I'm halfway there with the inclusion/exclusion principle, but I can't quite get my head around how to apply it to my particular problem, especially as the number of events goes above 3, or in the case of the sets of intermediate size, e.g. how to group all the 2 element sets above.

Answer Source

For a problem this small--there are at most 2^20 (around a million) possibilities--brute force works fine.

To illustrate, let's generate some data of moderate size:

n <- 15
set.seed(17)
p <- runif(n)
loss <- ceiling(rgamma(n, 3, 1/2))
signif(rbind(Probability=p, Loss=loss), 2)

Here are the input values for this example:

Probability  0.16 0.97  0.47 0.78  0.41 0.54  0.21 0.19 0.78  0.19  0.43 0.0023  0.83  0.83  0.96
Loss        12.00 4.00 10.00 8.00 10.00 6.00 12.00 5.00 4.00  8.00  8.00 8.0000  4.00  4.00  4.00

Generate a binary indicator of the power set with expand.grid and then use array operations for relatively fast calculation of the losses and the probabilities of all the possible outcomes:

powerset <- t(expand.grid(lapply(p, function(x) 0:1)))
probability <- apply(powerset * (2*p - 1) + (1-p), 2, prod)
losses <- colSums(powerset * loss)

(On this aging Xeon workstation, this takes up to 5 seconds when n is 20.)

Summarize by loss using tapply:

x <- tapply(probability, losses, sum)

(This takes another 1 to 2 seconds when n is 20.)

We can check for consistency by (a) verifying the probabilities sum to unity and (b) checking that the expected loss is the sum of the expected losses of the individual events:

if(sum(probability) - 1 != 0) warning("Unnormalized probability.")
if(sum(probability * losses) - sum(p*loss) != 0) warning("Inconsistent result.")

Let's plot the resulting loss distribution.

library(ggplot2)
ggplot(data.frame(Loss=as.numeric(names(x)), Probability=x), 
       aes(Loss, Probability)) + 
  geom_col(color="White")

Figure

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download