Bob McBobson Bob McBobson - 6 months ago 18
Python Question

How can I create multiple data models, then pick the model(s) that are evaluated to be the best fitting?

I have a training-test function set up in R that takes a set of data, excludes a certain portion of it (to preclude the model overfitting the data), and then trains a linear model on about half of the remaining data before testing the model on the other half.

I should note that set of data are based on PCA scores, which is why the linear model is set to include the seven PCA components.

splitprob = 0.7
trainindex = createDataPartition(scores$y, p=splitprob, list=F)
trainingset = scores[trainindex,]
testset = scores[-trainindex,]
model = glm(y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, data=trainingset)
summary(model)
prediction = predict.lm(model, trainingset, se.fit=T)


Now, what I want to do is run this script multiple times, produce multiple models, and then pick one or more models that will be used to make future predictions. Although I have set up the function to be run a certain number of times, I do not know how to set it up so that I can compare the different models against one another (probably by using the AIC), nor am I sure how I should capture parameters of the model(s) in order to export them to a text file or a .csv file.

I tried implementing the glmulti package, but due to various problems in using Java, rJava, and Mac OsSX, I have been having massive problems in getting it to install properly. Could anyone recommend me another approaches to this problem at all?

Answer Source

I updated (modified) the answer to return a sorted data.frame, which includes the number of the iteration and the corresponding 'aic', 'rmse' (root-mean-squared-error),

num_iters = 100         # 100 train-test splits 

build data

set.seed(1)
y = runif(100)
x = matrix(runif(1000), nrow = 100, ncol = 10)

repeated train-test splits

list_models = list()

aic_models = matrix(0, nrow = 100, ncol = 2)

rmse_models = matrix(0, nrow = 100, ncol = 2)


for (i in 1:num_iters) {

  spl = sample(1:100, 50, replace = F)

  train = data.frame(y = y[spl], x[spl, ])

  test = data.frame(y = y[-spl], x[-spl, ])

  fit = glm(y~., data = train)

  pred_test = predict(fit, newdata = test)

  aic_models[i, ] = c(i, summary(fit)$aic)                                  # saves the iteration and the corresponding aic

  rmse_models[i, ] = c(i, Metrics::rmse(y[-spl], pred_test))                # saves the iteration and the corresponding rmse (Metrics package) - the lower the better

  list_models[[i]] = fit       # saves the current model
}

convert the resulted aic-rmse matrices to data frames

sort_aic = as.data.frame(aic_models)

colnames(sort_aic) = c('iteration', 'aic_value')


sort_rmse = as.data.frame(rmse_models)

colnames(sort_rmse) = c('iteration', 'rmse_value')

sort the aic - rmse models (the lower the better for both cases)

sort_aic = sort_aic[order(sort_aic$aic_value, decreasing = FALSE), ]

sort_rmse = sort_rmse[order(sort_rmse$rmse_value, decreasing = FALSE), ]


print(sort_aic[1:10, ])

   iteration  aic_value
59        59 -3.2890503
90        90 -1.5475516
63        63  0.7166507
7          7  2.8596637
47        47  3.4051807
95        95  3.6488699
76        76  3.9099417
65        65  3.9244424
70        70  4.4830083
75        75  4.5077221

print(sort_rmse[1:10, ])

   iteration rmse_value
28        28  0.2428743
69        69  0.2517444
96        96  0.2523283
44        44  0.2525145
10        10  0.2538310
8          8  0.2576595
64        64  0.2582306
36        36  0.2586123
6          6  0.2604191
51        51  0.2607045

depending on the output matrices ('sort_aic', 'sort_rmse') you can select the 'n' best models (here I select n = 10 models)

n = 10

bst_aic = sort_aic$iteration[1:n]         # best models using aic

bst_rmse = sort_rmse$iteration[1:n]       # best models using rmse

receive the best models using the 'bst_aic' or 'bst_rmse' indexing

aic_bst_models = list_models[bst_aic]

rmse_bst_models = list_models[bst_rmse]

do predictions based on one of the selected models

preds_new_data = predict(rmse_bst_models[[1]], newdata = ....)

You can use an evaluation metric that is appropriate for your data and you can also have a look to the difference between 'aic' and 'bic' in another stackoverflow question.

Recommended from our users: Dynamic Network Monitoring from WhatsUp Gold from IPSwitch. Free Download