Bob McBobson - 1 year ago 36

Python Question

I have a training-test function set up in R that takes a set of data, excludes a certain portion of it (to preclude the model overfitting the data), and then trains a linear model on about half of the remaining data before testing the model on the other half.

I should note that set of data are based on PCA scores, which is why the linear model is set to include the seven PCA components.

`splitprob = 0.7`

trainindex = createDataPartition(scores$y, p=splitprob, list=F)

trainingset = scores[trainindex,]

testset = scores[-trainindex,]

model = glm(y ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7, data=trainingset)

summary(model)

prediction = predict.lm(model, trainingset, se.fit=T)

Now, what I want to do is run this script multiple times, produce multiple models, and then pick one or more models that will be used to make future predictions. Although I have set up the function to be run a certain number of times, I do not know how to set it up so that I can compare the different models against one another (probably by using the AIC), nor am I sure how I should capture parameters of the model(s) in order to export them to a text file or a .csv file.

I tried implementing the glmulti package, but due to various problems in using Java, rJava, and Mac OsSX, I have been having massive problems in getting it to install properly. Could anyone recommend me another approaches to this problem at all?

Recommended for you: Get network issues from **WhatsUp Gold**. **Not end users.**

Answer Source

I updated (modified) the answer to return a sorted data.frame, which includes the number of the iteration and the corresponding 'aic', 'rmse' (root-mean-squared-error),

```
num_iters = 100 # 100 train-test splits
```

build data

```
set.seed(1)
y = runif(100)
x = matrix(runif(1000), nrow = 100, ncol = 10)
```

repeated train-test splits

```
list_models = list()
aic_models = matrix(0, nrow = 100, ncol = 2)
rmse_models = matrix(0, nrow = 100, ncol = 2)
for (i in 1:num_iters) {
spl = sample(1:100, 50, replace = F)
train = data.frame(y = y[spl], x[spl, ])
test = data.frame(y = y[-spl], x[-spl, ])
fit = glm(y~., data = train)
pred_test = predict(fit, newdata = test)
aic_models[i, ] = c(i, summary(fit)$aic) # saves the iteration and the corresponding aic
rmse_models[i, ] = c(i, Metrics::rmse(y[-spl], pred_test)) # saves the iteration and the corresponding rmse (Metrics package) - the lower the better
list_models[[i]] = fit # saves the current model
}
```

convert the resulted aic-rmse matrices to data frames

```
sort_aic = as.data.frame(aic_models)
colnames(sort_aic) = c('iteration', 'aic_value')
sort_rmse = as.data.frame(rmse_models)
colnames(sort_rmse) = c('iteration', 'rmse_value')
```

sort the aic - rmse models (the lower the better for both cases)

```
sort_aic = sort_aic[order(sort_aic$aic_value, decreasing = FALSE), ]
sort_rmse = sort_rmse[order(sort_rmse$rmse_value, decreasing = FALSE), ]
print(sort_aic[1:10, ])
iteration aic_value
59 59 -3.2890503
90 90 -1.5475516
63 63 0.7166507
7 7 2.8596637
47 47 3.4051807
95 95 3.6488699
76 76 3.9099417
65 65 3.9244424
70 70 4.4830083
75 75 4.5077221
print(sort_rmse[1:10, ])
iteration rmse_value
28 28 0.2428743
69 69 0.2517444
96 96 0.2523283
44 44 0.2525145
10 10 0.2538310
8 8 0.2576595
64 64 0.2582306
36 36 0.2586123
6 6 0.2604191
51 51 0.2607045
```

depending on the output matrices ('sort_aic', 'sort_rmse') you can select the 'n' best models (here I select n = 10 models)

```
n = 10
bst_aic = sort_aic$iteration[1:n] # best models using aic
bst_rmse = sort_rmse$iteration[1:n] # best models using rmse
```

receive the best models using the 'bst_aic' or 'bst_rmse' indexing

```
aic_bst_models = list_models[bst_aic]
rmse_bst_models = list_models[bst_rmse]
```

do predictions based on one of the selected models

```
preds_new_data = predict(rmse_bst_models[[1]], newdata = ....)
```

You can use an evaluation metric that is appropriate for your data and you can also have a look to the difference between 'aic' and 'bic' in another stackoverflow question.

Recommended from our users: **Dynamic Network Monitoring from WhatsUp Gold from IPSwitch**. ** Free Download**