I have a training-test function set up in R that takes a set of data, excludes a certain portion of it (to preclude the model overfitting the data), and then trains a linear model on about half of the remaining data before testing the model on the other half.
I should note that set of data are based on PCA scores, which is why the linear model is set to include the seven PCA components.
splitprob = 0.7
trainindex = createDataPartition(scores$y, p=splitprob, list=F)
trainingset = scores[trainindex,]
testset = scores[-trainindex,]
model = glm(y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, data=trainingset)
summary(model)
prediction = predict.lm(model, trainingset, se.fit=T)
I updated (modified) the answer to return a sorted data.frame, which includes the number of the iteration and the corresponding 'aic', 'rmse' (root-mean-squared-error),
num_iters = 100 # 100 train-test splits
build data
set.seed(1)
y = runif(100)
x = matrix(runif(1000), nrow = 100, ncol = 10)
repeated train-test splits
list_models = list()
aic_models = matrix(0, nrow = 100, ncol = 2)
rmse_models = matrix(0, nrow = 100, ncol = 2)
for (i in 1:num_iters) {
spl = sample(1:100, 50, replace = F)
train = data.frame(y = y[spl], x[spl, ])
test = data.frame(y = y[-spl], x[-spl, ])
fit = glm(y~., data = train)
pred_test = predict(fit, newdata = test)
aic_models[i, ] = c(i, summary(fit)$aic) # saves the iteration and the corresponding aic
rmse_models[i, ] = c(i, Metrics::rmse(y[-spl], pred_test)) # saves the iteration and the corresponding rmse (Metrics package) - the lower the better
list_models[[i]] = fit # saves the current model
}
convert the resulted aic-rmse matrices to data frames
sort_aic = as.data.frame(aic_models)
colnames(sort_aic) = c('iteration', 'aic_value')
sort_rmse = as.data.frame(rmse_models)
colnames(sort_rmse) = c('iteration', 'rmse_value')
sort the aic - rmse models (the lower the better for both cases)
sort_aic = sort_aic[order(sort_aic$aic_value, decreasing = FALSE), ]
sort_rmse = sort_rmse[order(sort_rmse$rmse_value, decreasing = FALSE), ]
print(sort_aic[1:10, ])
iteration aic_value
59 59 -3.2890503
90 90 -1.5475516
63 63 0.7166507
7 7 2.8596637
47 47 3.4051807
95 95 3.6488699
76 76 3.9099417
65 65 3.9244424
70 70 4.4830083
75 75 4.5077221
print(sort_rmse[1:10, ])
iteration rmse_value
28 28 0.2428743
69 69 0.2517444
96 96 0.2523283
44 44 0.2525145
10 10 0.2538310
8 8 0.2576595
64 64 0.2582306
36 36 0.2586123
6 6 0.2604191
51 51 0.2607045
depending on the output matrices ('sort_aic', 'sort_rmse') you can select the 'n' best models (here I select n = 10 models)
n = 10
bst_aic = sort_aic$iteration[1:n] # best models using aic
bst_rmse = sort_rmse$iteration[1:n] # best models using rmse
receive the best models using the 'bst_aic' or 'bst_rmse' indexing
aic_bst_models = list_models[bst_aic]
rmse_bst_models = list_models[bst_rmse]
do predictions based on one of the selected models
preds_new_data = predict(rmse_bst_models[[1]], newdata = ....)
You can use an evaluation metric that is appropriate for your data and you can also have a look to the difference between 'aic' and 'bic' in another stackoverflow question.