Eloff Eloff - 9 months ago 48
R Question

Predict binary outcome based on past data

A program takes an input text, randomly applies a variety of changes to the wording based on a custom thesaurus. A human reviews the changes, and possibly reverts some of the changes that don't make sense in the given text. Whether each substitution that was either accepted or reverted is saved to a database.

Now I want to use that data to give a probability between 0 and 1 that a given change should be applied based on the past history of how often it was accepted or reverted. The idea is that changes that are known to be problematic are made less frequently when the algorithm is randomly selecting changes to make to the input text.

So there's accept_count and revert_count. How should I derive a probability from that? It should be some kind of asymptotic function that starts at around 0.5 (for lack of a better default) for no data, and then moves up or down based on the values of the respective counters. A simple ratio won't do, because double the counts and a ratio stays the same, but statistically speaking we're more confident of the prediction and the calculated probability should reflect that.

Answer Source

I propose the confidence interval for the probability of accept_counts as a possible answer. In R you can get that easily via the function binom.test:

prob <- function (yes, no) binom.test(c(yes, no))$conf.int[1:2]

If you have not counts yet, the confidence interval ranges from 0 to 1, which is sensible:

> prob(0,0)
[1] 0 1

if you have 4 accepts and 5 rejects, or you double and have 8 versus 10, than you will have confidence intervalls around the same mean but the higher confidence can be seen in the narrower confidence interval:

> prob(4,5)
[1] 0.1369957 0.7879915
> prob(8,10)
[1] 0.2153015 0.6924283

You asked for it to be some kind of asymptotic. Let's assume the first 100 occurences are rejects and from then on, only accepts appear. You can look at the confidence intervalls over time in the plot that the following code draws:

prob <- function (yes, no)
 binom.test(c(yes, no))$conf.int[1:2]

acc.low <- numeric(500)
acc.high <- numeric(500)
for(i in 1:1000){
  intervall <- prob(100,i)
  acc.low[i] <- intervall[1]
  acc.high[i] <- intervall[2]    

plot(acc.low, ylim=c(0, max(acc.high)), ylab="95% CI", type="l")