MichaelChirico - 2 months ago 13
R Question

Defining an infix operator for use within a formula

I'm trying to create a more parsimonious version of this solution, which entails specifying the RHS of a formula in the form

`d1 + d1:d2`
.

Given that
`*`
in the context of a formula is a pithy stand-in for full interaction (i.e.
`d1 * d2`
gives
`d1 + d2 + d1:d2`
), my approach has been to try and define an alternative operator, say
`%+:%`
using the infix approach I've grown accustomed to in other applications, a la:

``````"%+:%" <- function(d1,d2) d1 + d2 + d1:d2
``````

However, this predictably fails because I haven't been careful about evaluation; let's introduce an example to illustrate my progress:

``````set.seed(1029)
v1 <- runif(1000)
v2 <- runif(1000)
y <- .8*(v1 < .3) + .2 * (v2 > .25 & v2 < .8) -
.4 * (v2 > .8) + .1 * (v1 > .3 & v2 > .8)
``````

With this example, hopefully it's clear why simply writing out the two terms might be undesirable:

``````y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3)
``````

One workaround which is close to my desired output is to define the whole formula as a function:

``````plus.times <- function(outvar, d1, d2){
as.formula(paste0(quote(outvar), "~", quote(d1),
"+", quote(d1), ":", quote(d2)))
}
``````

This gives the expected coefficients when passed to
`lm`
, but with names that are harder to interpret directly (especially in the real data where we take care to give
`d1`
and
`d2`
descriptive names, in contrast to this generic example):

``````out1 <- lm(y ~ cut(v2, breaks = c(0, .25, .8, 1)) +
cut(v2, breaks = c(0, .25, .8, 1)):I(v1 < .3))
out2 <- lm(plus.times(y, cut(v2, breaks = c(0, .25, .8, 1)), I(v1 < .3)))
any(out1\$coefficients != out2\$coefficients)
# [1] FALSE
names(out2\$coefficients)
# [1] "(Intercept)"         "d1(0.25,0.8]"        "d1(0.8,1]"           "d1(0,0.25]:d2TRUE"
# [5] "d1(0.25,0.8]:d2TRUE" "d1(0.8,1]:d2TRUE"
``````

So this is less than optimal.

Is there any way to define the adjust the code so that the infix operator I mentioned above works as expected? How about altering the form of
`plus.times`
so that the variables are not renamed?

I've been poking around (
`?formula`
,
`?"~"`
,
`?":"`
,
`getAnywhere(formula.default)`
, this answer, etc.) but haven't seen how exactly R interprets
`*`
when it's encountered in a formula so that I can make my desired minor adjustments.

You do not need to define a new operator in this case: in a formula `d1/d2` expands to `d1 + d1:d2`. In other words `d1/d2` specifies that `d2` is nested within `d1`. Continuing your example:

``````out3 <- lm(y ~ cut(v2,breaks=c(0,.25,.8,1))/I(v1 < .3))
all.equal(coef(out1), coef(out3))
# [1] TRUE
``````

These relationships are reflected in the parametrization of the model. For crossed factors we use `d1*d2` or `d1 + d2 + d1:d2`, to give the main effect of each factor, plus the interaction. For nested factors we use `d1/d2` or `d1 + d1:d2` to give a separate submodel of the form `1 + d2` for each level of `d1`.
The idea of nesting is not restricted to factors, for example we may use `sex/x` to fit a separate linear regression on `x` for males and females.
In a formula, `%in%` is equivalent to `:`, but it may be used to emphasize the nested, or hierarchical structure of the data/model. For example, `a + b %in% a` is the same as `a + a:b`, but reading it as "a plus b within a" gives a better description of the model being fitted. Even so, using `/` has the advantage of simplifying the model formula at the same time as emphasizing the structure.