Rymatt830 - 6 months ago 32

R Question

I am getting different results from lmPerm based on the order in which I enter the variables in the function call.

For example, placing NCF.pf before TotalProperties yields the following:

`pfit <- lmp(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)`

summary(pfit)

...

Coefficients:

Estimate Iter Pr(Prob)

NCF.pf 4.581e-01 51 1

TotalProperties 5.246e+04 5000 <2e-16 ***

but, when I switch the order of the coefficients in the formula andplace TotalProperties before NCF.pf, the p-value on NCF.pf becomes significant

`pfit2 <- lmp(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)`

summary(pfit2)

...

Coefficients:

Estimate Iter Pr(Prob)

TotalProperties 5.246e+04 5000 <2e-16 ***

NCF.pf 4.581e-01 5000 <2e-16 ***

Am I missing something? Why would the p-values be different just because I switched the order of the variables in the function call?

The data can be found on GitHub at this link.

When calling the standard lm function twice (reversing the order of the variables on the second call), the p-values are identical (see below). Hence, unlike when using the lmPerm function, the order of the variables doesn't matter with lm.

`fit1 <- lm(NetCashOps ~ NCF.pf + TotalProperties, data = sub.pm, subset = Presence == 1)`

summary(fit1)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.088e+05 2.258e+05 3.138 0.0019 **

NCF.pf 4.581e-01 1.112e-01 4.121 5.11e-05 ***

TotalProperties 5.246e+04 9.519e+03 5.511 8.76e-08 ***

fit2 <- lm(NetCashOps ~ TotalProperties + NCF.pf, data = sub.pm, subset = Presence == 1)

summary(fit2)

...

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 7.088e+05 2.258e+05 3.138 0.0019 **

TotalProperties 5.246e+04 9.519e+03 5.511 8.76e-08 ***

NCF.pf 4.581e-01 1.112e-01 4.121 5.11e-05 ***

Thanks!

Answer

I already saw 2 close votes to migrate this to Cross Validated, but in my humble opinion this should stay on Stack Overflow. It is true, that t-statistic and p-value are not invariant to the specification order of terms, under the non-pivoted QR factorization strategy used by `lm`

and `lmp`

, but as shown in the new edit, for OP's data, these statistic should be invariant. So there must be something sensitive at programming level.

My quick diagnose suggests, that if we set `seqs = TRUE`

, rather than using the default `FALSE`

, we would get consistent result:

```
## I have subsetted data with `Presence == 1` into a new dataset `dat`
## I have also renamed variable name for simplicity
coef(summary(lmp(y ~ x1 + x2, dat, seqs = TRUE)))
# Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000 0
#x1 4.580840e-01 5000 0
#x2 5.245619e+04 5000 0
coef(summary(lmp(y ~ x2 + x1, dat, seqs = TRUE)))
# Estimate Iter Pr(Prob)
#(Intercept) 2.019959e+06 5000 0
#x2 5.245619e+04 5000 0
#x1 4.580840e-01 5000 0
```

Note, the `Pr(Prob)`

should be "< 2e-16" when printed by `summary`

, but when using `coef`

to obtain a matrix, those tiny values are 0.

The documentation of `?lmp`

mentions a little on this part:

```
The SS will be calculated _sequentially_, just as ‘lm()’ does; or
they may be calculated _uniquely_, which means that the SS for
each source is calculated conditionally on all other sources.
```

I am at the moment not sure what `SS`

is (as I am not a user of `lmPerm`

), but this sounds like that for consistent result, we should set `seqs = TRUE`

.