I'm trying to create a more parsimonious version of this solution, which entails specifying the RHS of a formula in the form
d1 + d1:d2
*
d1 * d2
d1 + d2 + d1:d2
%+:%
"%+:%" <- function(d1,d2) d1 + d2 + d1:d2
set.seed(1029)
v1 <- runif(1000)
v2 <- runif(1000)
y <- .8*(v1 < .3) + .2 * (v2 > .25 & v2 < .8) -
.4 * (v2 > .8) + .1 * (v1 > .3 & v2 > .8)
y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3)
plus.times <- function(outvar, d1, d2){
as.formula(paste0(quote(outvar), "~", quote(d1),
"+", quote(d1), ":", quote(d2)))
}
lm
d1
d2
out1 <- lm(y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3))
out2 <- lm(plus.times(y, cut(v2, breaks = c(0, .25, .8, 1)), I(v1 < .3)))
any(out1$coefficients != out2$coefficients)
# [1] FALSE
names(out2$coefficients)
# [1] "(Intercept)" "d1(0.25,0.8]" "d1(0.8,1]" "d1(0,0.25]:d2TRUE"
# [5] "d1(0.25,0.8]:d2TRUE" "d1(0.8,1]:d2TRUE"
plus.times
?formula
?"~"
?":"
getAnywhere(formula.default)
*
You do not need to define a new operator in this case: in a formula d1/d2
expands to d1 + d1:d2
. In other words d1/d2
specifies that d2
is nested within d1
. Continuing your example:
out3 <- lm(y ~ cut(v2,breaks=c(0,.25,.8,1))/I(v1 < .3))
all.equal(coef(out1), coef(out3))
# [1] TRUE
Further comments
Factors may be crossed or nested. Two factors are crossed if it possible to observe every combination of levels of the two factors, e.g. sex and treatment, temperature and pH, etc. A factor is nested within another if each level of that factor can only be observed within one of the levels of the other factor, e.g. town and country, staff member and store etc.
These relationships are reflected in the parametrization of the model. For crossed factors we use d1*d2
or d1 + d2 + d1:d2
, to give the main effect of each factor, plus the interaction. For nested factors we use d1/d2
or d1 + d1:d2
to give a separate submodel of the form 1 + d2
for each level of d1
.
The idea of nesting is not restricted to factors, for example we may use sex/x
to fit a separate linear regression on x
for males and females.
In a formula, %in%
is equivalent to :
, but it may be used to emphasize the nested, or hierarchical structure of the data/model. For example, a + b %in% a
is the same as a + a:b
, but reading it as "a plus b within a" gives a better description of the model being fitted. Even so, using /
has the advantage of simplifying the model formula at the same time as emphasizing the structure.