Skip to content

Hyper-parameter tuning with agua

agua sets up the infrastructure for the tune package to enable optimization of h2o models. Similar to other models, we label the hyperparameter with the tune() placeholder and feed them into tune_*() functions such as tune_grid() and tune_bayes(). With some API changes before agua’s next CRAN release, for now we will need a specific pull request in the tune package to run the following examples

devtools::install_github('tidymodels/tune#531')

Next, we will go through the tuning example from Introduction to tune with the Ames housing data.

library(tidymodels)
library(agua)
library(ggplot2)
theme_set(theme_bw())
doParallel::registerDoParallel()
h2o_start()
data(ames)

set.seed(4595)
data_split <- ames %>%
  mutate(Sale_Price = log10(Sale_Price)) %>%
  initial_split(strata = Sale_Price)
ames_train <- training(data_split)
ames_test  <- testing(data_split)
cv_splits <- vfold_cv(ames_train, v = 10, strata = Sale_Price)

ames_rec <- 
  recipe(Sale_Price ~ Gr_Liv_Area + Longitude + Latitude, data = ames_train) %>% 
  step_log(Gr_Liv_Area, base = 10) %>% 
  step_ns(Longitude, deg_free = tune("long df")) %>% 
  step_ns(Latitude,  deg_free = tune("lat df"))

lm_mod <- linear_reg(penalty = tune()) %>% 
  set_engine("h2o")

lm_wflow <- workflow() %>% 
  add_model(lm_mod) %>%
  add_recipe(ames_rec)

grid <- lm_wflow %>%
  extract_parameter_set_dials() %>% 
  grid_regular(levels = 5)

ames_res <- tune_grid(
  lm_wflow, 
  resamples = cv_splits, 
  grid = grid, 
  control = control_grid(save_pred = TRUE)
)

ames_res 
#> # Tuning results
#> # 10-fold cross-validation using stratification 
#> # A tibble: 10 × 5
#>    splits             id     .metrics           .notes           .predic…¹
#>    <list>             <chr>  <list>             <list>           <list>   
#>  1 <split [1976/221]> Fold01 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  2 <split [1976/221]> Fold02 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  3 <split [1976/221]> Fold03 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  4 <split [1976/221]> Fold04 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  5 <split [1977/220]> Fold05 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  6 <split [1977/220]> Fold06 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  7 <split [1978/219]> Fold07 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  8 <split [1978/219]> Fold08 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#>  9 <split [1979/218]> Fold09 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#> 10 <split [1980/217]> Fold10 <tibble [250 × 7]> <tibble [1 × 3]> <tibble> 
#> # … with abbreviated variable name ¹​.predictions
#> 
#> There were issues with some computations:
#> 
#>   - Warning(s) x10: A correlation computation is required, but `estimate` is co...
#> 
#> Run `show_notes(.Last.tune.result)` for more information.

The syntax is the same, we provide a workflow and the grid of hyper-parameters, then tune_grid() returns cross validation performances for every parameterization per resample. There are two small differences to note when tuning h2o models:

  • Remember to call h2o_start() beforehand to enable all the h2o side of computations.

  • h2o supports only a regular grid of hyper-parameters, as created by expand.grid(), see ?dials::grid_regular for more details. As such we need to set grid in tune_*() functions explicitly to be a data frame of regular grid and prevent the default grid generation.

Other functions in tune for working with tuning results such as collect_metrics(), collect_predictions() and autoplot() will also recognize ames_res and work as expected.

collect_metrics(ames_res, summarize = FALSE)
#> # A tibble: 2,500 × 8
#>    id          penalty `long df` `lat df` .metric .estim…¹ .esti…² .config
#>    <chr>         <dbl>     <int>    <int> <chr>   <chr>      <dbl> <chr>  
#>  1 Fold01 0.0000000001         1        1 rmse    standard  0.115  Prepro…
#>  2 Fold01 0.0000000001         1        1 rsq     standard  0.550  Prepro…
#>  3 Fold02 0.0000000001         1        1 rmse    standard  0.112  Prepro…
#>  4 Fold02 0.0000000001         1        1 rsq     standard  0.603  Prepro…
#>  5 Fold03 0.0000000001         1        1 rmse    standard  0.116  Prepro…
#>  6 Fold03 0.0000000001         1        1 rsq     standard  0.563  Prepro…
#>  7 Fold04 0.0000000001         1        1 rmse    standard  0.112  Prepro…
#>  8 Fold04 0.0000000001         1        1 rsq     standard  0.581  Prepro…
#>  9 Fold05 0.0000000001         1        1 rmse    standard  0.0998 Prepro…
#> 10 Fold05 0.0000000001         1        1 rsq     standard  0.637  Prepro…
#> # … with 2,490 more rows, and abbreviated variable names ¹​.estimator,
#> #   ²​.estimate
autoplot(ames_res, metric = "rmse")

One current limitation with parallel processing is that parallelization on the R side is done over resamples, i.e., control = control_grid(parallel_over = 'resamples'). We can’t set parallel_over = 'everything' for an inner parallel loop of tuning parameters. Yet the h2o server by can build models in parallel with adaptive parallelism.

TODO: how to use parallelism in h2o::h2o.grid

Tuning internals

For users interested to understand the limitations and performance characteristics of tuning h2o models with agua, it is helpful to know some inner workings of h2o. agua uses the h2o::h2o.grid() function for tuning model parameters, which accepts a list of hyper-parameters, construct a regular grid, and search for the optimal combination for a given dataset.

In the above example, we have three tuning parameters of two types:

extract_parameter_set_dials(lm_wflow)
#> Collection of 3 parameters for tuning
#> 
#>  identifier     type    object
#>     penalty  penalty nparam[+]
#>     long df deg_free nparam[+]
#>      lat df deg_free nparam[+]
  • tuning parameters in the model: penalty.

  • tuning parameters in the preprocessor: long df and lat df.

Since h2o.grid does not optimize parameters in the preprocessor, possible values of long df and lat df in grid will be iterated as usual on the R side. Once a certain combination of them is chosen, agua will engineer all the relevant features, convert grid to a list of hyper-parameters, then passes the data and model definitions to h2o_grid(). Next, h2o.grid() conducts a series of validations to prepare data for the server, one of which is an expand.grid() on the list of hyper-parameters to generate a regular grid. After this, we have completed computations on the R side and the rest of model tuning is now delegated to the h2o server.

If you have a parallel backend registered, models will be optimized and evaluated in parallel per resample. The h2o server would still require the complete set of parameters for one resample. For this reason, agua does not support parallel_over = 'everything' on the R side.

Regarding the performance of model evaluation, h2o.grid() supports passing in a validation frame but does not return predictions on that data. In order to compute metrics on the holdout sample, we have to retrieve the model, convert validation data into the desired format, predict on it, and convert the results back to data frames. In the future we hope to get validation predictions directly thus eliminating excessive data conversions.