Rymatt830 Rymatt830 - 28 days ago 11
R Question

lmPerm P-Values Different depending on Order of Coefficients

I am getting different results from lmPerm based on the order in which I enter the variables in the function call.

For example, placing NCF.pf before TotalProperties yields the following:

pfit <- lmp(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)

summary(pfit)
...
Coefficients:
Estimate Iter Pr(Prob)
NCF.pf 4.581e-01 51 1
TotalProperties 5.246e+04 5000 <2e-16 ***


but, when I switch the order of the coefficients in the formula andplace TotalProperties before NCF.pf, the p-value on NCF.pf becomes significant

pfit2 <- lmp(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)

summary(pfit2)
...
Coefficients:
Estimate Iter Pr(Prob)
TotalProperties 5.246e+04 5000 <2e-16 ***
NCF.pf 4.581e-01 5000 <2e-16 ***


Am I missing something? Why would the p-values be different just because I switched the order of the variables in the function call?

Update - Data Source and lm Output (11/11/2016)

The data can be found on GitHub at this link.

When calling the standard lm function twice (reversing the order of the variables on the second call), the p-values are identical (see below). Hence, unlike when using the lmPerm function, the order of the variables doesn't matter with lm.

fit1 <- lm(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)

summary(fit1)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.088e+05 2.258e+05 3.138 0.0019 **
NCF.pf 4.581e-01 1.112e-01 4.121 5.11e-05 ***
TotalProperties 5.246e+04 9.519e+03 5.511 8.76e-08 ***


fit2 <- lm(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)

summary(fit2)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.088e+05 2.258e+05 3.138 0.0019 **
TotalProperties 5.246e+04 9.519e+03 5.511 8.76e-08 ***
NCF.pf 4.581e-01 1.112e-01 4.121 5.11e-05 ***


Thanks!

Answer

I already saw 2 close votes to migrate this to Cross Validated, but in my humble opinion this should stay on Stack Overflow. It is true, that t-statistic and p-value are not invariant to the specification order of terms, under the non-pivoted QR factorization strategy used by lm and lmp, but as shown in the new edit, for OP's data, these statistic should be invariant. So there must be something sensitive at programming level.

My quick diagnose suggests, that if we set seqs = TRUE, rather than using the default FALSE, we would get consistent result:

## I have subsetted data with `Presence == 1` into a new dataset `dat`
## I have also renamed variable name for simplicity

coef(summary(lmp(y ~ x1 + x2, dat, seqs = TRUE)))

#                Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000        0
#x1          4.580840e-01 5000        0
#x2          5.245619e+04 5000        0

coef(summary(lmp(y ~ x2 + x1, dat, seqs = TRUE)))
#                Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000        0
#x2          5.245619e+04 5000        0
#x1          4.580840e-01 5000        0

Note, the Pr(Prob) should be "< 2e-16" when printed by summary, but when using coef to obtain a matrix, those tiny values are 0.

The documentation of ?lmp mentions a little on this part:

The SS will be calculated _sequentially_, just as ‘lm()’ does; or
they may be calculated _uniquely_, which means that the SS for
each source is calculated conditionally on all other sources.

I am at the moment not sure what SS is (as I am not a user of lmPerm), but this sounds like that for consistent result, we should set seqs = TRUE.