Buechel - 8 months ago 51

R Question

I am doing regression on language data where I want to predict a numeric emotion value for a sentence. My data is 120x531. I'm using a so-called bag-of-words approach so my data is relatively sparse.

I want to start with a simple linear regression model, so my code is essentially this:

`ctrl = trainControl(method="cv", number=10)`

model.valence.lm = train(data[,5:531], data[,2], model = "lm", trControl = ctrl)

model.valence.lm

However, caret seems to confuse linear models and random forests so I get the following output (see in particular the first line):

`Random Forest`

120 samples

527 predictors

No pre-processing

Resampling: Cross-Validated (10 fold)

Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...

Resampling results across tuning parameters:

mtry RMSE Rsquared RMSE SD Rsquared SD

2 2.594079 0.2786009 0.1236510 0.1612251

32 2.459950 0.1920956 0.1886138 0.1484976

526 2.639718 0.1028518 0.2459268 0.1067835

RMSE was used to select the optimal model using the smallest value.

The final value used for the model was mtry = 32.

What makes this even more confusing for me is the fact, that I basically copied and pasted this code from a previous project (where this worked). Does anyone has any idea on why this happens? I checked my data object, apparently the features I use are Integers (not numerics/floats). Might this be a possible explanation?

Answer

Random Forest or "rf" is the default argument for the `method`

parameter. You have set the `model`

parameter, which caret has accepted without complaint but ignored. Use `method="lm"`

.