dan - 1 year ago 204
R Question

# Set contrasts in glm

I have binomial count data, coming from a set of conditions, that are overdisperesed. To simulate them I use the beta binomial distribution implemented by the

`rbetabinom`
function of the
`emdbook`
`R`
package:

``````library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=F)
df\$k <- sapply(1:nrow(df), function(x) rbetabinom(n=1, prob=df\$p[x], size=df\$n[x],theta = df\$theta[x], shape1=1, shape2=1))
``````

I want to find the effect of each condition (
`cond`
) on the counts (
`k`
).
I think the
`glm.nb`
model of the
`MASS`
`R`
package allows modelling that:

``````library(MASS)
fit <- glm.nb(k ~ cond + offset(log(n)), data = df)
``````

My question is how to set the contrasts such that I get the effect of each condition relative to the mean effects over all conditions rather than relative to the
`dummy`
condition
`A`
?

Two things: (1) if you want contrasts relative to the mean, use `contr.sum` rather than the default `contr.treatment`; (2) you probably shouldn't fit beta-binomial data with a negative binomial model; use a beta-binomial model instead (e.g. via `VGAM` or `bbmle`)!

``````library(emdbook)
set.seed(1)
df <- data.frame(p = rep(runif(3,0,1)),
n = as.integer(runif(30,100,200)),
theta = rep(runif(3,1,5)),
cond = rep(LETTERS[1:3],10),
stringsAsFactors=FALSE)
## slightly abbreviated
df\$k <- rbetabinom(n=nrow(df), prob=df\$p,
size=df\$n,theta = df\$theta, shape1=1, shape2=1)
``````

With `VGAM`:

`````` library(VGAM)
## note dbetabinom/rbetabinom from emdbook are masked
options(contrasts=c("contr.sum","contr.poly"))
vglm(cbind(k,n-k)~cond,data=df,
family=betabinomialff(zero=2)
## hold shape parameter 2 constant
)
## Coefficients:
## (Intercept):1 (Intercept):2         cond1         cond2
##     0.4312181     0.5197579    -0.3121925     0.3011559
## Log-likelihood: -147.7304
``````

Here intercept is the mean shape parameter across the levels; `cond1` and `cond2` are the differences of levels 1 and 2 from the mean (this doesn't give you the difference of level 3 from the mean, but by construction it should be (`-cond1-cond2`) ...)

I find the parameterization with `bbmle` (with logit-probability and dispersion parameter) a little easier:

`````` detach("package:VGAM")
library(bbmle)
mle2(k~dbetabinom(k, prob=plogis(lprob),
size=n,  theta=exp(ltheta)),
parameters=list(lprob~cond),
data=df,
start=list(lprob=0,ltheta=0))
## Coefficients:
## lprob.(Intercept)       lprob.cond1       lprob.cond2            ltheta
##       -0.09606536       -0.31615236        0.17353311        1.15201809
##
## Log-likelihood: -148.09
``````

The log-likelihoods are about the same (the VGAM parameterization is a bit better); in theory, if we allowed both shape1 and shape2 (VGAM) or lprob and ltheta (`bbmle`) to vary across conditions, we'd get the same log-likelihoods for both parameterizations.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download