dan dan - 4 months ago 28
R Question

Set contrasts in glm

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the

rbetabinom
function of the
emdbook
R
package:

library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df$p[x], size=df$n[x],theta = df$theta[x], shape1=1, shape2=1))


I want to find the effect of each condition (
cond
) on the counts (
k
).
I think the
glm.nb
model of the
MASS
R
package allows modelling that:

library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)


My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the
dummy
condition
A
?

Answer

Two things: (1) if you want contrasts relative to the mean, use contr.sum rather than the default contr.treatment; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via VGAM or bbmle)!

library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
             n = as.integer(runif(30,100,200)),
             theta = rep(runif(3,1,5)),
             cond = rep(LETTERS[1:3],10),
             stringsAsFactors=FALSE)
 ## slightly abbreviated
 df$k <- rbetabinom(n=nrow(df), prob=df$p,
                    size=df$n,theta = df$theta, shape1=1, shape2=1)

With VGAM:

 library(VGAM)
 ## note dbetabinom/rbetabinom from emdbook are masked
 options(contrasts=c("contr.sum","contr.poly"))
 vglm(cbind(k,n-k)~cond,data=df,
        family=betabinomialff(zero=2)
        ## hold shape parameter 2 constant
 )
 ## Coefficients:
 ## (Intercept):1 (Intercept):2         cond1         cond2 
 ##     0.4312181     0.5197579    -0.3121925     0.3011559 
 ## Log-likelihood: -147.7304 

Here intercept is the mean shape parameter across the levels; cond1 and cond2 are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (-cond1-cond2) ...)

I find the parameterization with bbmle (with logit-probability and dispersion parameter) a little easier:

 detach("package:VGAM")
 library(bbmle)
 mle2(k~dbetabinom(k, prob=plogis(lprob),
                   size=n,  theta=exp(ltheta)),
      parameters=list(lprob~cond),
      data=df,
      start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept)       lprob.cond1       lprob.cond2            ltheta 
##       -0.09606536       -0.31615236        0.17353311        1.15201809 
## 
## Log-likelihood: -148.09 

The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (bbmle) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.