Buechel Buechel - 1 year ago 59
R Question

R's Caret package confuses linear model (lm) and random forest

I am doing regression on language data where I want to predict a numeric emotion value for a sentence. My data is 120x531. I'm using a so-called bag-of-words approach so my data is relatively sparse.

I want to start with a simple linear regression model, so my code is essentially this:

ctrl = trainControl(method="cv", number=10)
model.valence.lm = train(data[,5:531], data[,2], model = "lm", trControl = ctrl)

However, caret seems to confuse linear models and random forests so I get the following output (see in particular the first line):

Random Forest

120 samples
527 predictors

No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
Resampling results across tuning parameters:

mtry RMSE Rsquared RMSE SD Rsquared SD
2 2.594079 0.2786009 0.1236510 0.1612251
32 2.459950 0.1920956 0.1886138 0.1484976
526 2.639718 0.1028518 0.2459268 0.1067835

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 32.

What makes this even more confusing for me is the fact, that I basically copied and pasted this code from a previous project (where this worked). Does anyone has any idea on why this happens? I checked my data object, apparently the features I use are Integers (not numerics/floats). Might this be a possible explanation?

Answer Source

Random Forest or "rf" is the default argument for the method parameter. You have set the model parameter, which caret has accepted without complaint but ignored. Use method="lm".