Orca AutoML#

orca.automl.auto_estimator#

A general estimator supports automatic model tuning. It allows users to fit and search the best hyperparameter for their model.

class bigdl.orca.automl.auto_estimator.AutoEstimator(model_builder: ModelBuilder, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, remote_dir: Optional[str] = None, name: Optional[str] = None)[source]#

Bases: object

Example

>>> auto_est = AutoEstimator.from_torch(model_creator=model_creator,
                                        optimizer=get_optimizer,
                                        loss=nn.BCELoss(),
                                        logs_dir="/tmp/zoo_automl_logs",
                                        resources_per_trial={"cpu": 2},
                                        name="test_fit")
>>> auto_est.fit(data=data,
                 validation_data=validation_data,
                 search_space=create_linear_search_space(),
                 n_sampling=4,
                 epochs=1,
                 metric="accuracy")
>>> best_model = auto_est.get_best_model()
static from_torch(*, model_creator: Callable, optimizer: Callable, loss: Callable, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, name: str = 'auto_pytorch_estimator', remote_dir: Optional[str] = None) bigdl.orca.automl.auto_estimator.AutoEstimator[source]#

Create an AutoEstimator for torch.

Parameters
  • model_creator – PyTorch model creator function.

  • optimizer – PyTorch optimizer creator function or pytorch optimizer name (string). Note that you should specify learning rate search space with key as “lr” or LR_NAME (from bigdl.orca.automl.pytorch_utils import LR_NAME) if input optimizer name. Without learning rate search space specified, the default learning rate value of 1e-3 will be used for all estimators.

  • loss – PyTorch loss instance or PyTorch loss creator function or pytorch loss name (string).

  • logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_estimator_logs”

  • resources_per_trial – Dict. resources for each trial. e.g. {“cpu”: 2}.

  • name – Name of the auto estimator. It defaults to “auto_pytorch_estimator”

  • remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.

Returns

an AutoEstimator object.

static from_keras(*, model_creator: Callable, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, name: str = 'auto_keras_estimator', remote_dir: Optional[str] = None) bigdl.orca.automl.auto_estimator.AutoEstimator[source]#

Create an AutoEstimator for tensorflow keras.

Parameters
  • model_creator – Tensorflow keras model creator function.

  • logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_estimator_logs”

  • resources_per_trial – Dict. resources for each trial. e.g. {“cpu”: 2}.

  • name – Name of the auto estimator. It defaults to “auto_keras_estimator”

  • remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.

Returns

an AutoEstimator object.

fit(data: Union[Callable, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[Callable, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[Function, float, int]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) None[source]#

Automatically fit the model and search for the best hyperparameters.

Parameters
  • data – train data. If the AutoEstimator is created with from_torch, data can be a tuple of ndarrays or a PyTorch DataLoader or a function that takes a config dictionary as parameter and returns a PyTorch DataLoader. If the AutoEstimator is created with from_keras, data can be a tuple of ndarrays or a function that takes a config dictionary as parameter and returns a Tensorflow Dataset. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data.

  • epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.

  • validation_data – Validation data. Validation data type should be the same as data.

  • metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.

  • metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.

  • metric_threshold – a trial will be terminated when metric threshold is met

  • n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.

  • search_space – a dict for search space

  • search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)

  • search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode

  • scheduler – str, all supported scheduler provided by ray tune

  • scheduler_params – parameters for scheduler

  • feature_cols – feature column names if data is Spark DataFrame.

  • label_cols – target column names if data is Spark DataFrame.

get_best_model()[source]#

Return the best model found by the AutoEstimator

Returns

the best model instance

get_best_config()[source]#

Return the best config found by the AutoEstimator

Returns

A dictionary of best hyper parameters

orca.automl.hp#

Sampling specs to be used in search space configuration.

bigdl.orca.automl.hp.uniform(lower: float, upper: float) ray.tune.sample.Float[source]#

Sample a float uniformly between lower and upper.

Parameters
  • lower – Lower bound of the sampling range.

  • upper – Upper bound of the sampling range.

bigdl.orca.automl.hp.quniform(lower: float, upper: float, q: float) ray.tune.sample.Float[source]#

Sample a float uniformly between lower and upper. Round the result to nearest value with granularity q, include upper.

Parameters
  • lower – Lower bound of the sampling range.

  • upper – Upper bound of the sampling range.

  • q – Granularity for increment.

bigdl.orca.automl.hp.loguniform(lower: float, upper: float, base: int = 10) ray.tune.sample.Float[source]#

Sample a float between lower and upper. Power distribute uniformly between log_{base}(lower) and log_{base}(upper).

Parameters
  • lower – Lower bound of the sampling range.

  • upper – Upper bound of the sampling range.

  • base – Log base for distribution. Default to 10.

bigdl.orca.automl.hp.qloguniform(lower: float, upper: float, q: float, base: int = 10) ray.tune.sample.Float[source]#

Sample a float between lower and upper. Power distribute uniformly between log_{base}(lower) and log_{base}(upper). Round the result to nearest value with granularity q, include upper.

Parameters
  • lower – Lower bound of the sampling range.

  • upper – Upper bound of the sampling range.

  • q – Granularity for increment.

  • base – Log base for distribution. Default to 10.

bigdl.orca.automl.hp.randn(mean: float = 0.0, std: float = 1.0) ray.tune.sample.Float[source]#

Sample a float from normal distribution.

Parameters
  • mean – Mean of the normal distribution. Default to 0.0.

  • std – Std of the normal distribution. Default to 1.0.

bigdl.orca.automl.hp.qrandn(mean: float, std: float, q: float) ray.tune.sample.Float[source]#

Sample a float from normal distribution. Round the result to nearest value with granularity q.

Parameters
  • mean – Mean of the normal distribution. Default to 0.0.

  • std – Std of the normal distribution. Default to 1.0.

  • q – Granularity for increment.

bigdl.orca.automl.hp.randint(lower: int, upper: int) ray.tune.sample.Integer[source]#

Uniformly sample integer between lower and upper. (Both inclusive)

Parameters
  • lower – Lower bound of the sampling range.

  • upper – Upper bound of the sampling range.

bigdl.orca.automl.hp.qrandint(lower: int, upper: int, q: int = 1) ray.tune.sample.Integer[source]#

Uniformly sample integer between lower and upper. (Both inclusive) Round the result to nearest value with granularity q.

Parameters
  • lower – Lower bound of the sampling range.

  • upper – Upper bound of the sampling range.

  • q – Integer Granularity for increment.

bigdl.orca.automl.hp.choice(categories: List) ray.tune.sample.Categorical[source]#

Uniformly sample from a list

Parameters

categories – A list to be sampled.

bigdl.orca.automl.hp.choice_n(categories: List, min_items: int, max_items: int) ray.tune.sample.Function[source]#

Sample a subset from a list

Parameters
  • categories – A list to be sampled

  • min_items – minimum number of items to be sampled

  • max_items – maximum number of items to be sampled

bigdl.orca.automl.hp.sample_from(func: Callable) Callable[source]#

Sample from a function.

Parameters

func – The function to be sampled.

Specifying grid search over a list.

Parameters

values – A list to be grid searched.

orca.automl.metrics#

Evaluate unscaled metrics between y true value and y predicted value.

bigdl.orca.automl.metrics.sMAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate Symmetric mean absolute percentage error (sMAPE).

\[\text{sMAPE} = \frac{100\%}{n} \sum_{t=1}^n \frac{|y_t-\hat{y_t}|}{|y_t|+|\hat{y_t}|}\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate mean percentage error (MPE).

\[\text{MPE} = \frac{100\%}{n}\sum_{t=1}^n \frac{y_t-\hat{y_t}}{y_t}\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate mean absolute percentage error (MAPE).

\[\text{MAPE} = \frac{100\%}{n}\sum_{t=1}^n |\frac{y_t-\hat{y_t}}{y_t}|\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MDAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate Median Absolute Percentage Error (MDAPE).

\[\text{MDAPE} = 100\%\ median(|\frac{y_1-\hat{y_1}}{y_1}|, \ldots, |\frac{y_n-\hat{y_n}}{y_n}|)\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.sMDAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate Symmetric Median Absolute Percentage Error (sMDAPE).

\[\text{sMDAPE} = 100\%\ median(\frac{|y_1-\hat{y_1}|}{|y_1|+|\hat{y_1}|}, \ldots, \frac{|y_n-\hat{y_n}|}{|y_n|+|\hat{y_n}|})\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.ME(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate Mean Error (ME).

\[\text{ME} = \frac{1}{n}\sum_{t=1}^n y_t-\hat{y_t}\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MSPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate mean squared percentage error (MSPE).

\[\text{MSPE} = \frac{100\%}{n}\sum_{t=1}^n (\frac{y_n-\hat{y_n}}{y_n})^2\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MSLE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate the mean squared log error(MSLE).

\[\text{MSLE} = \frac{1}{n}\sum_{t=1}^n (log_e(1+y_t)-log_e(1+\hat{y_t}))^2\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.R2(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate the r2 score.

\[R^2 = 1-\frac{\sum_{t=1}^n (y_t-\hat{y_t})^2}{\sum_{t=1}^n (y_t-\bar{y})^2}\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 1.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MAE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate the mean absolute error (MAE).

\[\text{MAE} = \frac{1}{n}\sum_{t=1}^n |y_t-\hat{y_t}|\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.RMSE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray][source]#

Calculate square root of the mean squared error (RMSE).

\[\text{RMSE} = \sqrt{(\frac{1}{n}\sum_{t=1}^n (y_t-\hat{y_t})^2)}\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MSE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'uniform_average') Union[numpy.float64, numpy.ndarray][source]#

Calculate the mean squared error (MSE).

\[\text{MSE} = \frac{1}{n}\sum_{t=1}^n (y_t-\hat{y_t})^2\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.Accuracy(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput=None) Union[numpy.float64, numpy.ndarray][source]#

Calculate the accuracy score (Accuracy).

\[\text{Accuracy} = \frac{1}{n}\sum_{t=1}^n 1(y_t=\hat{y_t})\]
Parameters
  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 1.0), or an array of floating point values, one for each individual target.

class bigdl.orca.automl.metrics.Evaluator[source]#

Bases: object

Evaluate metrics for y_true and y_pred.

static evaluate(metric: str, y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') numpy.float64[source]#

Evaluate a specific metric for y_true and y_pred.

Parameters
  • metric – String in [‘me’, ‘mae’, ‘mse’, ‘rmse’, ‘msle’, ‘r2’ , ‘mpe’, ‘mape’, ‘mspe’, ‘smape’, ‘mdape’, ‘smdape’, ‘accuracy’]

  • y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.

  • y_pred – Array-like of shape = (n_samples, *). Estimated target values.

  • multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A floating point value, or an array of floating point values, one for each individual target.

orca.automl.auto_xgb#

Automatic hyperparameter optimization for XGBoost models.

AutoXGBoost is inherited from AutoEstimator. You could refer to AutoEstimator API Guide for more APIs.

class bigdl.orca.automl.xgboost.auto_xgb.AutoXGBClassifier(logs_dir: str = '/tmp/auto_xgb_classifier_logs', cpus_per_trial: int = 1, name: Optional[str] = None, remote_dir: Optional[str] = None, **xgb_configs)[source]#

Bases: bigdl.orca.automl.auto_estimator.AutoEstimator

Automated xgboost classifier

Example

>>> search_space = {"n_estimators": hp.grid_search([50, 1000]),
                    "max_depth": hp.grid_search([2, 15]),
                    "lr": hp.loguniform(1e-4, 1e-1)}
>>> auto_xgb_clf = AutoXGBClassifier(cpus_per_trial=4,
                                     name="auto_xgb_classifier",
                                     **config)
>>> auto_xgb_clf.fit(data=(X_train, y_train),
                     validation_data=(X_val, y_val),
                     metric="error",
                     metric_mode="min",
                     n_sampling=1,
                     search_space=search_space)
>>> best_model = auto_xgb_clf.get_best_model()
Parameters
  • logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_xgb_classifier_logs”

  • cpus_per_trial – Int. Number of cpus for each trial. It defaults to 1. The value will also be assigned to n_jobs in xgboost, which is the number of parallel threads used to run xgboost.

  • name – Name of the auto xgboost classifier.

  • remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.

  • xgb_configs – Other scikit learn xgboost parameters. You may refer to https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn for the parameter names to specify. Note that we will directly use cpus_per_trial value for n_jobs in xgboost and you shouldn’t specify n_jobs again.

fit(data: Union[partial, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[partial, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[int, float]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) None[source]#

Automatically fit the model and search for the best hyperparameters.

Parameters
  • data – A Spark DataFrame, a tuple of ndarrays or a function. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data. If data is a function, it should takes config as argument and returns a tuple of ndarrays in the form of (x, y).

  • epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.

  • validation_data – Validation data. Validation data type should be the same as data.

  • metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.

  • metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.

  • metric_threshold – a trial will be terminated when metric threshold is met

  • n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.

  • search_space – a dict for search space

  • search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)

  • search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode

  • scheduler – str, all supported scheduler provided by ray tune

  • scheduler_params – parameters for scheduler

  • feature_cols – feature column names if data is Spark DataFrame.

  • label_cols – target column names if data is Spark DataFrame.

class bigdl.orca.automl.xgboost.auto_xgb.AutoXGBRegressor(logs_dir: str = '/tmp/auto_xgb_regressor_logs', cpus_per_trial: int = 1, name: Optional[str] = None, remote_dir: Optional[str] = None, **xgb_configs)[source]#

Bases: bigdl.orca.automl.auto_estimator.AutoEstimator

Automated xgboost regressor

Example

>>> search_space = {"n_estimators": hp.grid_search([800, 1000]),
                    "max_depth": hp.grid_search([10, 15]),
                    "lr": hp.loguniform(1e-4, 1e-1),
                    "min_child_weight": hp.choice([1, 2, 3]),
                    }
>>> auto_xgb_reg = AutoXGBRegressor(cpus_per_trial=2,
                                    name="auto_xgb_regressor",
                                    **config)
>>> auto_xgb_reg.fit(data=(X_train, y_train),
                     validation_data=(X_val, y_val),
                     metric="rmse",
                     n_sampling=1,
                     search_space=search_space)
>>> best_model = auto_xgb_reg.get_best_model()
Parameters
  • logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_xgb_classifier_logs”

  • cpus_per_trial – Int. Number of cpus for each trial. The value will also be assigned to n_jobs, which is the number of parallel threads used to run xgboost.

  • name – Name of the auto xgboost classifier.

  • remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.

  • xgb_configs – Other scikit learn xgboost parameters. You may refer to https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn for the parameter names to specify. Note that we will directly use cpus_per_trial value for n_jobs in xgboost and you shouldn’t specify n_jobs again.

fit(data: Union[partial, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[partial, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[float, int]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) None[source]#

Automatically fit the model and search for the best hyperparameters.

Parameters
  • data – A Spark DataFrame, a tuple of ndarrays or a function. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data. If data is a function, it should takes config as argument and returns a tuple of ndarrays in the form of (x, y).

  • epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.

  • validation_data – Validation data. Validation data type should be the same as data.

  • metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.

  • metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.

  • metric_threshold – a trial will be terminated when metric threshold is met

  • n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.

  • search_space – a dict for search space

  • search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)

  • search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode

  • scheduler – str, all supported scheduler provided by ray tune

  • scheduler_params – parameters for scheduler

  • feature_cols – feature column names if data is Spark DataFrame.

  • label_cols – target column names if data is Spark DataFrame.