autoEnsemble {autoEnsemble} | R Documentation |
Automatically Trains H2O Models and Builds a Stacked Ensemble Model
Description
Automatically trains various algorithms to build base-learners and then automatically creates a stacked ensemble model
Usage
autoEnsemble(
x,
y,
training_frame,
validation_frame = NULL,
nfolds = 10,
balance_classes = TRUE,
max_runtime_secs = NULL,
max_runtime_secs_per_model = NULL,
max_models = NULL,
sort_metric = "AUCPR",
include_algos = c("GLM", "DeepLearning", "DRF", "XGBoost", "GBM"),
save_models = FALSE,
directory = paste("autoEnsemble", format(Sys.time(), "%d-%m-%y-%H:%M")),
...,
newdata = NULL,
family = "binary",
strategy = c("search"),
model_selection_criteria = c("auc", "aucpr", "mcc", "f2"),
min_improvement = 1e-05,
max = NULL,
top_rank = seq(0.01, 0.99, 0.01),
stop_rounds = 3,
reset_stop_rounds = TRUE,
stop_metric = "auc",
seed = -1,
verbatim = FALSE,
startH2O = FALSE,
nthreads = NULL,
max_mem_size = NULL,
min_mem_size = NULL,
ignore_config = FALSE,
bind_to_localhost = FALSE,
insecure = TRUE
)
Arguments
x |
Vector. Predictor column names or indices. |
y |
Character. The response column name or index. |
training_frame |
An H2OFrame containing the training data.
Default is |
validation_frame |
An H2OFrame for early stopping. Default is |
nfolds |
Integer. Number of folds for cross-validation. Default is 10. |
balance_classes |
Logical. Specify whether to oversample the minority classes to balance the class distribution; only applicable to classification |
max_runtime_secs |
Integer. This argument specifies the maximum time that the AutoML process will run for in seconds. |
max_runtime_secs_per_model |
Maximum runtime in seconds dedicated to each individual model training process. |
max_models |
Maximum number of models to build in the AutoML training (passed to autoML) |
sort_metric |
Metric to sort the leaderboard by (passed to autoML). For binomial classification choose between "AUC", "AUCPR", "logloss", "mean_per_class_error", "RMSE", "MSE". For regression choose between "mean_residual_deviance", "RMSE", "MSE", "MAE", and "RMSLE". For multinomial classification choose between "mean_per_class_error", "logloss", "RMSE", "MSE". Default is "AUTO". If set to "AUTO", then "AUC" will be used for binomial classification, "mean_per_class_error" for multinomial classification, and "mean_residual_deviance" for regression. |
include_algos |
Vector of character strings naming the algorithms to restrict to during the model-building phase. this argument is passed to autoML. |
save_models |
Logical. if TRUE, the models trained will be stored locally |
directory |
path to a local directory to store the trained models |
... |
parameters to be passed to autoML algorithm in h2o package |
newdata |
h2o frame (data.frame). the data.frame must be already uploaded on h2o server (cloud). when specified, this dataset will be used for evaluating the models. if not specified, model performance on the training dataset will be reported. |
family |
model family. currently only |
strategy |
character. the current available strategies are |
model_selection_criteria |
character, specifying the performance metrics that
should be taken into consideration for model selection. the default are
|
min_improvement |
numeric. specifies the minimum improvement in model evaluation metric to qualify further optimization search. |
max |
integer. specifies maximum number of models for each criteria to be extracted. the
default value is the |
top_rank |
numeric vector. specifies percentage of the top models taht
should be selected. if the strategy is |
stop_rounds |
integer. number of stoping rounds, in case the model stops improving |
reset_stop_rounds |
logical. if TRUE, everytime the model improves the stopping rounds penalty is resets to 0. |
stop_metric |
character. model stopping metric. the default is |
seed |
random seed (recommended) |
verbatim |
logical. if TRUE, it reports additional information about the progress of the model training, particularly used for debugging. |
startH2O |
Logical. if TRUE, h2o server will be initiated. |
nthreads |
arguments to be passed to h2o.init() |
max_mem_size |
arguments to be passed to h2o.init() |
min_mem_size |
arguments to be passed to h2o.init() |
ignore_config |
arguments to be passed to h2o.init() |
bind_to_localhost |
arguments to be passed to h2o.init() |
insecure |
arguments to be passed to h2o.init() |
Value
a list including the ensemble model and the top-rank models that were used in the model
Author(s)
E. F. Haghish
Examples
## Not run:
# load the required libraries for building the base-learners and the ensemble models
library(h2o)
library(autoEnsemble)
# initiate the h2o server
h2o.init(ignore_config = TRUE, nthreads = 2, bind_to_localhost = FALSE, insecure = TRUE)
# upload data to h2o cloud
prostate_path <- system.file("extdata", "prostate.csv", package = "h2o")
prostate <- h2o.importFile(path = prostate_path, header = TRUE)
### H2O provides 2 types of grid search for tuning the models, which are
### AutoML and Grid. Below, I tune 2 set of model grids and use them both
### for building the ensemble, just to set an example ...
#######################################################
### PREPARE AutoML Grid (takes a couple of minutes)
#######################################################
# run AutoML to tune various models (GLM, GBM, XGBoost, DRF, DeepLearning) for 120 seconds
y <- "CAPSULE"
prostate[,y] <- as.factor(prostate[,y]) #convert to factor for classification
aml <- h2o.automl(y = y, training_frame = prostate, max_runtime_secs = 120,
include_algos=c("DRF","GLM", "XGBoost", "GBM", "DeepLearning"),
# this setting ensures the models are comparable for building a meta learner
seed = 2023, nfolds = 10,
keep_cross_validation_predictions = TRUE)
#######################################################
### PREPARE H2O Grid (takes a couple of minutes)
#######################################################
# make sure equal number of "nfolds" is specified for different grids
grid <- h2o.grid(algorithm = "gbm", y = y, training_frame = prostate,
hyper_params = list(ntrees = seq(1,50,1)),
grid_id = "ensemble_grid",
# this setting ensures the models are comparable for building a meta learner
seed = 2023, fold_assignment = "Modulo", nfolds = 10,
keep_cross_validation_predictions = TRUE)
#######################################################
### PREPARE ENSEMBLE MODEL
#######################################################
### get the models' IDs from the AutoML and grid searches.
### this is all that is needed before building the ensemble,
### i.e., to specify the model IDs that should be evaluated.
ids <- c(h2o.get_ids(aml), h2o.get_ids(grid))
top <- ensemble(models = ids, training_frame = prostate, strategy = "top")
search <- ensemble(models = ids, training_frame = prostate, strategy = "search")
#######################################################
### EVALUATE THE MODELS
#######################################################
h2o.auc(aml@leader) # best model identified by h2o.automl
h2o.auc(h2o.getModel(grid@model_ids[[1]])) # best model identified by grid search
h2o.auc(top$model). # ensemble model with 'top' search strategy
h2o.auc(search$model). # ensemble model with 'search' search strategy
## End(Not run)