forder - 6 months ago 31

R Question

I have three datasets:

response - matrix of 5(samples) x 10(dependent variables)

predictors - matrix of 5(samples) x 2(independent variables)

test_set - matrix of 10(samples) x 10(dependent variables defined in response)

`response <- matrix(sample.int(15, size = 5*10, replace = TRUE), nrow = 5, ncol = 10)`

colnames(response) <- c("1_DV","2_DV","3_DV","4_DV","5_DV","6_DV","7_DV","8_DV","9_DV","10_DV")

predictors <- matrix(sample.int(15, size = 7*2, replace = TRUE), nrow = 5, ncol = 2)

colnames(predictors) <- c("1_IV","2_IV")

test_set <- matrix(sample.int(15, size = 10*2, replace = TRUE), nrow = 10, ncol = 2)

colnames(test_set) <- c("1_IV","2_IV")

I'm doing a multivariate linear model using a training set defined as the combination of response and predictor sets, and I would like to use this model to make predictions for the test set:

`training_dataframe <- data.frame(predictors, response)`

fit <- lm(response ~ predictors, data = training_dataframe)

predictions <- predict(fit, data.frame(test_set))

However, the results for predictions are really odd:

`predictions`

First off the matrix dimensions are 5 x 10, which is the number of samples in the response variable by the number of DVs.

I'm not very skilled with this type of analysis in R, but shouldn't I be getting a 10 x 10 matrix, so that I have predictions for each row in my test_set?

Any help with this issue would be greatly appreciated,

Martin

Answer

You are stepping into a poorly supported part in R. The model class you have is "mlm", i.e., "multiple linear models", which is not the standard "lm" class. You get it when you have several (independent) response variables for a common set of covariates / predictors. Although `lm()`

function can fit such model, `predict`

method is poor for "mlm" class. If you look at `methods(predict)`

, you would see a `predict.mlm*`

. Normally for a linear model with "lm" class, `predict.lm`

is called when you call `predict`

; but for a "mlm" class the `predict.mlm*`

is called.

`predict.mlm*`

is too primitive. It does not allow `se.fit`

, i.e., it can not produce prediction errors, confidence / prediction intervals, etc, although this is possible in theory. It can only compute prediction mean. If so, why do we want to use `predict.mlm*`

at all?! The prediction mean can be obtained by a trivial matrix-matrix multiplication (in standard "lm" class this is a matrix-vector multiplication), so we can do it on our own.

Consider this small, reproduce example.

```
set.seed(0)
## 2 response of 10 observations each
response <- matrix(rnorm(20), 10, 2)
## 3 covariates with 10 observations each
predictors <- matrix(rnorm(30), 10, 3)
fit <- lm(response ~ predictors)
class(fit)
# [1] "mlm" "lm"
beta <- coef(fit)
# [,1] [,2]
#(Intercept) 0.5773235 -0.4752326
#predictors1 -0.9942677 0.6759778
#predictors2 -1.3306272 0.8322564
#predictors3 -0.5533336 0.6218942
```

When you have a prediction data set:

```
# 2 new observations for 3 covariats
test_set <- matrix(rnorm(6), 2, 3)
```

we first need to pad an intercept column

```
Xp <- cbind(1, test_set)
```

Then do this matrix multiplication

```
pred <- Xp %*% beta
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
```

Perhaps you have noticed that I did not even use a data frame here. Yes it is unnecessary as you have everything in matrix form.

But as a complete answer, it is a must to demonstrate how to use `predict.mlm`

to get our desired result.

```
set.seed(0)
response <- matrix(rnorm(20), 10, 2)
predictors <- matrix(rnorm(30), 10, 3)
test_set <- matrix(rnorm(6), 2, 3)
training_dataframe <- data.frame(response = I(response), predictors = I(predictors))
fit <- lm(response ~ predictors, data = training_dataframe)
newdat <- data.frame(predictors = I(test_set))
pred <- predict(fit, newdat)
# [,1] [,2]
#[1,] -2.905469 1.702384
#[2,] 1.871755 -1.236240
```

Note the `I()`

when I use `data.frame()`

. This is a must when we want to obtain **a data frame of matrices**. You can compare the difference between:

```
str(data.frame(response = I(response), predictors = I(predictors)))
#'data.frame': 10 obs. of 2 variables:
# $ response : AsIs [1:10, 1:2] 1.262954.... -0.32623.... 1.329799.... 1.272429.... 0.414641.... ...
# $ predictors: AsIs [1:10, 1:3] -0.22426.... 0.377395.... 0.133336.... 0.804189.... -0.05710.... ...
str(data.frame(response = response, predictors = predictors))
#'data.frame': 10 obs. of 5 variables:
# $ response.1 : num 1.263 -0.326 1.33 1.272 0.415 ...
# $ response.2 : num 0.764 -0.799 -1.148 -0.289 -0.299 ...
# $ predictors.1: num -0.2243 0.3774 0.1333 0.8042 -0.0571 ...
# $ predictors.2: num -0.236 -0.543 -0.433 -0.649 0.727 ...
# $ predictors.3: num 1.758 0.561 -0.453 -0.832 -1.167 ...
```

Without `I()`

to protect the matrix input, data are messy. It is amazing that this will not cause problem to `lm`

, but `predict.mlm`

will have a hard time obtaining the correct matrix for prediction, if you don't use `I()`

.

**Well, I would recommend using a "list" instead of a data frame in this case.** `data`

argument in `lm`

as well `newdata`

argument in `predict`

allows list input. A "list" is a more general structure than a data frame, which can hold any data structure without difficulty. We can do:

```
set.seed(0)
response <- matrix(rnorm(20), 10, 2)
predictors <- matrix(rnorm(30), 10, 3)
test_set <- matrix(rnorm(6), 2, 3)
training_list <- list(response = response, predictors = predictors)
fit <- lm(response ~ predictors, data = training_list)
newdat <- list(predictors = test_set)
pred <- predict(fit, newdat)
```