RTrain3k - 8 months ago 53

R Question

I am using

`plyr::ddply`

`model <- rating ~ A + B + C + D + E + F`

by the factor

`resp.id`

`indiv.betas <- ddply(data.coded, "resp.id",`

function(df) coef(lm(model, data=df)))

I am now trying to extract the p-values for the variables by the factor using:

`indiv.pvalues <- ddply(data.coded, "resp.id",`

function(df) coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])

Unfortunately, it just gives me a data frame with

`NaN`

Although, if I run a model across the entire data set, I can extract p-values from this one model as a data frame successfully with:

`pvalue <- as.data.frame(coef(summary(lm(model, data=df)))[, "Pr(>|t|)"])`

How can I create a data frame of the p-values by the factor?

Thanks.

Answer

When you fit a single model

```
rating ~ A + B + C + D + E + F
```

you get meaningful, non-NA result. While when you fit the same model for each subset / factor level by `resp.id`

, you get `NaN`

result. I am 100% sure that for some factor level, you don't have enough data to fit the above model. It would be a good idea, to first check how many data there are for each group. You can use:

```
N <- with(data.coded, tapply(rating, resp.id, FUN = length))
```

Your model has 7 coefficients (1 for intercept and 1 each for A, B, ..., F). So `which(N < 7)`

will tell you which factor levels are producing `NaN`

.

In this part, I will show that I am not able to reproduce your problem with `iris`

dataset.

```
library(plyr)
model <- Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width
ddply(iris, "Species", function(df) coefficients(lm(model, data=df)))
# Species (Intercept) Sepal.Width Petal.Length Petal.Width
#1 setosa 2.351890 0.6548350 0.2375602 0.2521257
#2 versicolor 1.895540 0.3868576 0.9083370 -0.6792238
#3 virginica 0.699883 0.3303370 0.9455356 -0.1697527
ddply(iris, "Species", function(df) coef(summary(lm(model, data=df)))[, 4])
# Species (Intercept) Sepal.Width Petal.Length Petal.Width
#1 setosa 3.034183e-07 6.834434e-09 2.593594e-01 0.470987
#2 versicolor 5.112246e-04 6.488965e-02 1.666695e-06 0.125599
#3 virginica 1.961563e-01 6.439972e-02 1.074269e-13 0.395875
```

In this part, I will show why `NaN`

could appear when there are more coefficients than data.

```
set.seed(0);
x1 <- rnorm(3); x2 <- rnorm(3); x3 <- rnorm(3)
y <- rnorm(3)
fit <- lm(y ~ x1 + x2 + x3) ## 3 data, 4 coefficients
coef(summary(fit))
# Estimate Std. Error t value Pr(>|t|)
#(Intercept) 0.4217653 NaN NaN NaN
#x1 0.4124869 NaN NaN NaN
#x2 1.1489330 NaN NaN NaN
```