Tgsmith61591 Tgsmith61591 - 3 months ago 23
Python Question

H2O R api: retrieving optimal model from grid search

I'm using the

h2o
package (v 3.6.0) in R, and I've built a grid search model. Now, I'm trying to access the model which minimizes MSE on the validation set. In python's
sklearn
, this is easily achievable when using
RandomizedSearchCV
:

## Pseudo code:
grid = RandomizedSearchCV(model, params, n_iter = 5)
grid.fit(X)
best = grid.best_estimator_


This, unfortunately, does not prove as straightforward in h2o. Here's an example you can recreate:

library(h2o)
## assume you got h2o initialized...

X <- as.h2o(iris[1:100,]) # Note: only using top two classes for example
grid <- h2o.grid(
algorithm = 'gbm',
x = names(X[,1:4]),
y = 'Species',
training_frame = X,
hyper_params = list(
distribution = 'bernoulli',
ntrees = c(25,50)
)
)


Viewing
grid
prints a wealth of information, including this portion:

> grid
ntrees distribution status_ok model_ids
50 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_1
25 bernoulli OK Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_16_model_0


With a bit of digging, you can access each individual model and view every metric imaginable:

> h2o.getModel(grid@model_ids[[1]])
H2OBinomialModel: gbm
Model ID: Grid_GBM_file1742e107fe5ba_csv_10.hex_11_model_R_1456492736353_18_model_1
Model Summary:
number_of_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
1 50 4387 1 1 1.00000 2 2 2.00000


H2OBinomialMetrics: gbm
** Reported on training data. **

MSE: 1.056927e-05
R^2: 0.9999577
LogLoss: 0.003256338
AUC: 1
Gini: 1

Confusion Matrix for F1-optimal threshold:
setosa versicolor Error Rate
setosa 50 0 0.000000 =0/50
versicolor 0 50 0.000000 =0/50
Totals 50 50 0.000000 =0/100

Maximum Metrics: Maximum metrics at their respective thresholds
metric threshold value idx
1 max f1 0.996749 1.000000 0
2 max f2 0.996749 1.000000 0
3 max f0point5 0.996749 1.000000 0
4 max accuracy 0.996749 1.000000 0
5 max precision 0.996749 1.000000 0
6 max absolute_MCC 0.996749 1.000000 0
7 max min_per_class_accuracy 0.996749 1.000000 0


And with a lot of digging, you can finally get to this:

> h2o.getModel(grid@model_ids[[1]])@model$training_metrics@metrics$MSE
[1] 1.056927e-05


This seems like a lot of kludgey work to get down to a metric that ought to be top-level for model selection (yes, I'm now interjecting my opinions...). In my situation, I've got a grid with hundreds of models, and my current, hacky solution just doesn't seems very "R-esque":

model_select_ <- function(grid) {
model_ids <- grid@model_ids
min = Inf
best_model = NULL

for(model_id in model_ids) {
model <- h2o.getModel(model_id)
mse <- model@model$training_metrics@metrics$MSE
if(mse < min) {
min <- mse
best_model <- model
}
}

best_model
}


This is so utilitarian for something that is so core to the practice of machine learning, and it just strikes me as odd that h2o would not have a "cleaner" method of extracting the optimal model, or at least model metrics.

Am I missing something? Is there no "out of the box" method for selecting the best model?

Answer

Yes, there is an easy way to extract the "top" model of an H2O grid search. There are also utility functions that will extract all the model metrics (e.g. h2o.mse) that you have been trying to access. Examples of how to do these things can be found in the h2o-r/demos and h2o-py/demos subfolders on the h2o-3 GitHub repo.

Since you are using R, here is a relevant code example that includes a grid search, with sorted results. You can also find how to access this information in the R documentation for the h2o.getGrid function.

Print out the auc for all of the models, sorted by validation AUC:

auc_table <- h2o.getGrid(grid_id = "eeg_demo_gbm_grid", sort_by = "auc", decreasing = TRUE)
print(auc_table)

Here is an example of the output:

H2O Grid Details
================

Grid ID: eeg_demo_gbm_grid 
Used hyper parameters: 
  -  ntrees 
  -  max_depth 
  -  learn_rate 
Number of models: 18 
Number of failed models: 0 

Hyper-Parameter Search Summary: ordered by decreasing auc
   ntrees max_depth learn_rate                  model_ids               auc
1     100         5        0.2 eeg_demo_gbm_grid_model_17 0.967771493797284
2      50         5        0.2 eeg_demo_gbm_grid_model_16 0.949609591795923
3     100         5        0.1  eeg_demo_gbm_grid_model_8  0.94941792664595
4      50         5        0.1  eeg_demo_gbm_grid_model_7 0.922075196552274
5     100         3        0.2 eeg_demo_gbm_grid_model_14 0.913785959685157
6      50         3        0.2 eeg_demo_gbm_grid_model_13 0.887706691652792
7     100         3        0.1  eeg_demo_gbm_grid_model_5 0.884064379717198
8       5         5        0.2 eeg_demo_gbm_grid_model_15 0.851187402678818
9      50         3        0.1  eeg_demo_gbm_grid_model_4 0.848921799270639
10      5         5        0.1  eeg_demo_gbm_grid_model_6 0.825662907513139
11    100         2        0.2 eeg_demo_gbm_grid_model_11 0.812030639460551
12     50         2        0.2 eeg_demo_gbm_grid_model_10 0.785379521713437
13    100         2        0.1  eeg_demo_gbm_grid_model_2  0.78299280750123
14      5         3        0.2 eeg_demo_gbm_grid_model_12 0.774673686150002
15     50         2        0.1  eeg_demo_gbm_grid_model_1 0.754834657912535
16      5         3        0.1  eeg_demo_gbm_grid_model_3 0.749285131682721
17      5         2        0.2  eeg_demo_gbm_grid_model_9 0.692702793188135
18      5         2        0.1  eeg_demo_gbm_grid_model_0 0.676144542037133

The top row in the table contains the model with the best AUC, so below we can grab that model and extract the validation AUC:

best_model <- h2o.getModel(auc_table@model_ids[[1]])
h2o.auc(best_model, valid = TRUE)

In order for the h2o.getGrid function to be able sort by a metric on the validation set, you need to actually pass the h2o.grid function a validation_frame. In your example above, you did not pass a validation_frame, so you can't evaluate the models in the grid on the validation set.