Orca AutoML#

orca.automl.auto_estimator#

A general estimator supports automatic model tuning. It allows users to fit and search the best hyperparameter for their model.

class bigdl.orca.automl.auto_estimator.AutoEstimator(model_builder: ModelBuilder, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, remote_dir: Optional[str] = None, name: Optional[str] = None)[source]#

Bases: object

Example

>>> auto_est = AutoEstimator.from_torch(model_creator=model_creator,
                                        optimizer=get_optimizer,
                                        loss=nn.BCELoss(),
                                        logs_dir="/tmp/zoo_automl_logs",
                                        resources_per_trial={"cpu": 2},
                                        name="test_fit")
>>> auto_est.fit(data=data,
                 validation_data=validation_data,
                 search_space=create_linear_search_space(),
                 n_sampling=4,
                 epochs=1,
                 metric="accuracy")
>>> best_model = auto_est.get_best_model()

static from_torch(*, model_creator: Callable, optimizer: Callable, loss: Callable, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, name: str = 'auto_pytorch_estimator', remote_dir: Optional[str] = None) → bigdl.orca.automl.auto_estimator.AutoEstimator[source]#

Create an AutoEstimator for torch.

Parameters

model_creator – PyTorch model creator function.
optimizer – PyTorch optimizer creator function or pytorch optimizer name (string). Note that you should specify learning rate search space with key as “lr” or LR_NAME (from bigdl.orca.automl.pytorch_utils import LR_NAME) if input optimizer name. Without learning rate search space specified, the default learning rate value of 1e-3 will be used for all estimators.
loss – PyTorch loss instance or PyTorch loss creator function or pytorch loss name (string).
logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_estimator_logs”
resources_per_trial – Dict. resources for each trial. e.g. {“cpu”: 2}.
name – Name of the auto estimator. It defaults to “auto_pytorch_estimator”
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.

Returns

an AutoEstimator object.

static from_keras(*, model_creator: Callable, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, name: str = 'auto_keras_estimator', remote_dir: Optional[str] = None) → bigdl.orca.automl.auto_estimator.AutoEstimator[source]#

Create an AutoEstimator for tensorflow keras.

Parameters

model_creator – Tensorflow keras model creator function.
logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_estimator_logs”
resources_per_trial – Dict. resources for each trial. e.g. {“cpu”: 2}.
name – Name of the auto estimator. It defaults to “auto_keras_estimator”
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.

Returns

an AutoEstimator object.

fit(data: Union[Callable, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[Callable, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[Function, float, int]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) → None[source]#

Automatically fit the model and search for the best hyperparameters.

Parameters

data – train data. If the AutoEstimator is created with from_torch, data can be a tuple of ndarrays or a PyTorch DataLoader or a function that takes a config dictionary as parameter and returns a PyTorch DataLoader. If the AutoEstimator is created with from_keras, data can be a tuple of ndarrays or a function that takes a config dictionary as parameter and returns a Tensorflow Dataset. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data.
epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.
validation_data – Validation data. Validation data type should be the same as data.
metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.
metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.
metric_threshold – a trial will be terminated when metric threshold is met
n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.
search_space – a dict for search space
search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)
search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode
scheduler – str, all supported scheduler provided by ray tune
scheduler_params – parameters for scheduler
feature_cols – feature column names if data is Spark DataFrame.
label_cols – target column names if data is Spark DataFrame.

get_best_model()[source]#

Return the best model found by the AutoEstimator

Returns: the best model instance

get_best_config()[source]#

Return the best config found by the AutoEstimator

Returns: A dictionary of best hyper parameters

orca.automl.hp#

Sampling specs to be used in search space configuration.

bigdl.orca.automl.hp.uniform(lower: float, upper: float) → ray.tune.sample.Float[source]#

Sample a float uniformly between lower and upper.

Parameters

lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.

bigdl.orca.automl.hp.quniform(lower: float, upper: float, q: float) → ray.tune.sample.Float[source]#

Sample a float uniformly between lower and upper. Round the result to nearest value with granularity q, include upper.

Parameters

lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
q – Granularity for increment.

bigdl.orca.automl.hp.loguniform(lower: float, upper: float, base: int = 10) → ray.tune.sample.Float[source]#

Sample a float between lower and upper. Power distribute uniformly between log_{base}(lower) and log_{base}(upper).

Parameters

lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
base – Log base for distribution. Default to 10.

bigdl.orca.automl.hp.qloguniform(lower: float, upper: float, q: float, base: int = 10) → ray.tune.sample.Float[source]#

Sample a float between lower and upper. Power distribute uniformly between log_{base}(lower) and log_{base}(upper). Round the result to nearest value with granularity q, include upper.

Parameters

lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
q – Granularity for increment.
base – Log base for distribution. Default to 10.

bigdl.orca.automl.hp.randn(mean: float = 0.0, std: float = 1.0) → ray.tune.sample.Float[source]#

Sample a float from normal distribution.

Parameters

mean – Mean of the normal distribution. Default to 0.0.
std – Std of the normal distribution. Default to 1.0.

bigdl.orca.automl.hp.qrandn(mean: float, std: float, q: float) → ray.tune.sample.Float[source]#

Sample a float from normal distribution. Round the result to nearest value with granularity q.

Parameters

mean – Mean of the normal distribution. Default to 0.0.
std – Std of the normal distribution. Default to 1.0.
q – Granularity for increment.

bigdl.orca.automl.hp.randint(lower: int, upper: int) → ray.tune.sample.Integer[source]#

Uniformly sample integer between lower and upper. (Both inclusive)

Parameters

lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.

bigdl.orca.automl.hp.qrandint(lower: int, upper: int, q: int = 1) → ray.tune.sample.Integer[source]#

Uniformly sample integer between lower and upper. (Both inclusive) Round the result to nearest value with granularity q.

Parameters

lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
q – Integer Granularity for increment.

bigdl.orca.automl.hp.choice(categories: List) → ray.tune.sample.Categorical[source]#

Uniformly sample from a list

Parameters: categories – A list to be sampled.

bigdl.orca.automl.hp.choice_n(categories: List, min_items: int, max_items: int) → ray.tune.sample.Function[source]#

Sample a subset from a list

Parameters

categories – A list to be sampled
min_items – minimum number of items to be sampled
max_items – maximum number of items to be sampled

bigdl.orca.automl.hp.sample_from(func: Callable) → Callable[source]#

Sample from a function.

Parameters: func – The function to be sampled.

bigdl.orca.automl.hp.grid_search(values: List) → dict[source]#

Specifying grid search over a list.

Parameters: values – A list to be grid searched.

orca.automl.metrics#

Evaluate unscaled metrics between y true value and y predicted value.

bigdl.orca.automl.metrics.sMAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate Symmetric mean absolute percentage error (sMAPE).

\[\text{sMAPE} = \frac{100\%}{n} \sum_{t=1}^n \frac{|y_t-\hat{y_t}|}{|y_t|+|\hat{y_t}|}\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate mean percentage error (MPE).

\[\text{MPE} = \frac{100\%}{n}\sum_{t=1}^n \frac{y_t-\hat{y_t}}{y_t}\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate mean absolute percentage error (MAPE).

\[\text{MAPE} = \frac{100\%}{n}\sum_{t=1}^n |\frac{y_t-\hat{y_t}}{y_t}|\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MDAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate Median Absolute Percentage Error (MDAPE).

\[\text{MDAPE} = 100\%\ median(|\frac{y_1-\hat{y_1}}{y_1}|, \ldots, |\frac{y_n-\hat{y_n}}{y_n}|)\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.sMDAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate Symmetric Median Absolute Percentage Error (sMDAPE).

\[\text{sMDAPE} = 100\%\ median(\frac{|y_1-\hat{y_1}|}{|y_1|+|\hat{y_1}|}, \ldots, \frac{|y_n-\hat{y_n}|}{|y_n|+|\hat{y_n}|})\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.ME(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate Mean Error (ME).

\[\text{ME} = \frac{1}{n}\sum_{t=1}^n y_t-\hat{y_t}\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MSPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate mean squared percentage error (MSPE).

\[\text{MSPE} = \frac{100\%}{n}\sum_{t=1}^n (\frac{y_n-\hat{y_n}}{y_n})^2\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MSLE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate the mean squared log error(MSLE).

\[\text{MSLE} = \frac{1}{n}\sum_{t=1}^n (log_e(1+y_t)-log_e(1+\hat{y_t}))^2\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.R2(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate the r2 score.

\[R^2 = 1-\frac{\sum_{t=1}^n (y_t-\hat{y_t})^2}{\sum_{t=1}^n (y_t-\bar{y})^2}\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 1.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MAE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate the mean absolute error (MAE).

\[\text{MAE} = \frac{1}{n}\sum_{t=1}^n |y_t-\hat{y_t}|\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.RMSE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → Union[numpy.float64, numpy.ndarray][source]#

Calculate square root of the mean squared error (RMSE).

\[\text{RMSE} = \sqrt{(\frac{1}{n}\sum_{t=1}^n (y_t-\hat{y_t})^2)}\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.MSE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'uniform_average') → Union[numpy.float64, numpy.ndarray][source]#

Calculate the mean squared error (MSE).

\[\text{MSE} = \frac{1}{n}\sum_{t=1}^n (y_t-\hat{y_t})^2\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.

bigdl.orca.automl.metrics.Accuracy(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput=None) → Union[numpy.float64, numpy.ndarray][source]#

Calculate the accuracy score (Accuracy).

\[\text{Accuracy} = \frac{1}{n}\sum_{t=1}^n 1(y_t=\hat{y_t})\]

Parameters

y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.

Returns

Float or ndarray of floats. A non-negative floating point value (the best value is 1.0), or an array of floating point values, one for each individual target.

class bigdl.orca.automl.metrics.Evaluator[source]#

Bases: object

Evaluate metrics for y_true and y_pred.

static evaluate(metric: str, y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') → numpy.float64[source]#

Evaluate a specific metric for y_true and y_pred.

Parameters

metric – String in [‘me’, ‘mae’, ‘mse’, ‘rmse’, ‘msle’, ‘r2’ , ‘mpe’, ‘mape’, ‘mspe’, ‘smape’, ‘mdape’, ‘smdape’, ‘accuracy’]
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]

Returns

Float or ndarray of floats. A floating point value, or an array of floating point values, one for each individual target.

orca.automl.auto_xgb#

Automatic hyperparameter optimization for XGBoost models.

AutoXGBoost is inherited from AutoEstimator. You could refer to AutoEstimator API Guide for more APIs.

class bigdl.orca.automl.xgboost.auto_xgb.AutoXGBClassifier(logs_dir: str = '/tmp/auto_xgb_classifier_logs', cpus_per_trial: int = 1, name: Optional[str] = None, remote_dir: Optional[str] = None, **xgb_configs)[source]#

Bases: bigdl.orca.automl.auto_estimator.AutoEstimator

Automated xgboost classifier

Example

>>> search_space = {"n_estimators": hp.grid_search([50, 1000]),
                    "max_depth": hp.grid_search([2, 15]),
                    "lr": hp.loguniform(1e-4, 1e-1)}
>>> auto_xgb_clf = AutoXGBClassifier(cpus_per_trial=4,
                                     name="auto_xgb_classifier",
                                     **config)
>>> auto_xgb_clf.fit(data=(X_train, y_train),
                     validation_data=(X_val, y_val),
                     metric="error",
                     metric_mode="min",
                     n_sampling=1,
                     search_space=search_space)
>>> best_model = auto_xgb_clf.get_best_model()

Parameters

logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_xgb_classifier_logs”
cpus_per_trial – Int. Number of cpus for each trial. It defaults to 1. The value will also be assigned to n_jobs in xgboost, which is the number of parallel threads used to run xgboost.
name – Name of the auto xgboost classifier.
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.
xgb_configs – Other scikit learn xgboost parameters. You may refer to https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn for the parameter names to specify. Note that we will directly use cpus_per_trial value for n_jobs in xgboost and you shouldn’t specify n_jobs again.

fit(data: Union[partial, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[partial, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[int, float]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) → None[source]#

Automatically fit the model and search for the best hyperparameters.

Parameters

data – A Spark DataFrame, a tuple of ndarrays or a function. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data. If data is a function, it should takes config as argument and returns a tuple of ndarrays in the form of (x, y).
epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.
validation_data – Validation data. Validation data type should be the same as data.
metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.
metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.
metric_threshold – a trial will be terminated when metric threshold is met
n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.
search_space – a dict for search space
search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)
search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode
scheduler – str, all supported scheduler provided by ray tune
scheduler_params – parameters for scheduler
feature_cols – feature column names if data is Spark DataFrame.
label_cols – target column names if data is Spark DataFrame.

class bigdl.orca.automl.xgboost.auto_xgb.AutoXGBRegressor(logs_dir: str = '/tmp/auto_xgb_regressor_logs', cpus_per_trial: int = 1, name: Optional[str] = None, remote_dir: Optional[str] = None, **xgb_configs)[source]#

Bases: bigdl.orca.automl.auto_estimator.AutoEstimator

Automated xgboost regressor

Example

>>> search_space = {"n_estimators": hp.grid_search([800, 1000]),
                    "max_depth": hp.grid_search([10, 15]),
                    "lr": hp.loguniform(1e-4, 1e-1),
                    "min_child_weight": hp.choice([1, 2, 3]),
                    }
>>> auto_xgb_reg = AutoXGBRegressor(cpus_per_trial=2,
                                    name="auto_xgb_regressor",
                                    **config)
>>> auto_xgb_reg.fit(data=(X_train, y_train),
                     validation_data=(X_val, y_val),
                     metric="rmse",
                     n_sampling=1,
                     search_space=search_space)
>>> best_model = auto_xgb_reg.get_best_model()

Parameters

logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_xgb_classifier_logs”
cpus_per_trial – Int. Number of cpus for each trial. The value will also be assigned to n_jobs, which is the number of parallel threads used to run xgboost.
name – Name of the auto xgboost classifier.
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.
xgb_configs – Other scikit learn xgboost parameters. You may refer to https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn for the parameter names to specify. Note that we will directly use cpus_per_trial value for n_jobs in xgboost and you shouldn’t specify n_jobs again.

fit(data: Union[partial, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[partial, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[float, int]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) → None[source]#

Automatically fit the model and search for the best hyperparameters.

Parameters

data – A Spark DataFrame, a tuple of ndarrays or a function. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data. If data is a function, it should takes config as argument and returns a tuple of ndarrays in the form of (x, y).
epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.
validation_data – Validation data. Validation data type should be the same as data.
metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.
metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.
metric_threshold – a trial will be terminated when metric threshold is met
n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.
search_space – a dict for search space
search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)
search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode
scheduler – str, all supported scheduler provided by ray tune
scheduler_params – parameters for scheduler
feature_cols – feature column names if data is Spark DataFrame.
label_cols – target column names if data is Spark DataFrame.