Orca AutoML#
orca.automl.auto_estimator#
A general estimator supports automatic model tuning. It allows users to fit and search the best hyperparameter for their model.
- class bigdl.orca.automl.auto_estimator.AutoEstimator(model_builder: ModelBuilder, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, remote_dir: Optional[str] = None, name: Optional[str] = None)[source]#
Bases:
object
Example
>>> auto_est = AutoEstimator.from_torch(model_creator=model_creator, optimizer=get_optimizer, loss=nn.BCELoss(), logs_dir="/tmp/zoo_automl_logs", resources_per_trial={"cpu": 2}, name="test_fit") >>> auto_est.fit(data=data, validation_data=validation_data, search_space=create_linear_search_space(), n_sampling=4, epochs=1, metric="accuracy") >>> best_model = auto_est.get_best_model()
- static from_torch(*, model_creator: Callable, optimizer: Callable, loss: Callable, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, name: str = 'auto_pytorch_estimator', remote_dir: Optional[str] = None) bigdl.orca.automl.auto_estimator.AutoEstimator [source]#
Create an AutoEstimator for torch.
- Parameters
model_creator – PyTorch model creator function.
optimizer – PyTorch optimizer creator function or pytorch optimizer name (string). Note that you should specify learning rate search space with key as “lr” or LR_NAME (from bigdl.orca.automl.pytorch_utils import LR_NAME) if input optimizer name. Without learning rate search space specified, the default learning rate value of 1e-3 will be used for all estimators.
loss – PyTorch loss instance or PyTorch loss creator function or pytorch loss name (string).
logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_estimator_logs”
resources_per_trial – Dict. resources for each trial. e.g. {“cpu”: 2}.
name – Name of the auto estimator. It defaults to “auto_pytorch_estimator”
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.
- Returns
an AutoEstimator object.
- static from_keras(*, model_creator: Callable, logs_dir: str = '/tmp/auto_estimator_logs', resources_per_trial: Optional[Dict[str, int]] = None, name: str = 'auto_keras_estimator', remote_dir: Optional[str] = None) bigdl.orca.automl.auto_estimator.AutoEstimator [source]#
Create an AutoEstimator for tensorflow keras.
- Parameters
model_creator – Tensorflow keras model creator function.
logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_estimator_logs”
resources_per_trial – Dict. resources for each trial. e.g. {“cpu”: 2}.
name – Name of the auto estimator. It defaults to “auto_keras_estimator”
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.
- Returns
an AutoEstimator object.
- fit(data: Union[Callable, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[Callable, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[Function, float, int]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) None [source]#
Automatically fit the model and search for the best hyperparameters.
- Parameters
data – train data. If the AutoEstimator is created with from_torch, data can be a tuple of ndarrays or a PyTorch DataLoader or a function that takes a config dictionary as parameter and returns a PyTorch DataLoader. If the AutoEstimator is created with from_keras, data can be a tuple of ndarrays or a function that takes a config dictionary as parameter and returns a Tensorflow Dataset. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data.
epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.
validation_data – Validation data. Validation data type should be the same as data.
metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.
metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.
metric_threshold – a trial will be terminated when metric threshold is met
n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.
search_space – a dict for search space
search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)
search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode
scheduler – str, all supported scheduler provided by ray tune
scheduler_params – parameters for scheduler
feature_cols – feature column names if data is Spark DataFrame.
label_cols – target column names if data is Spark DataFrame.
orca.automl.hp#
Sampling specs to be used in search space configuration.
- bigdl.orca.automl.hp.uniform(lower: float, upper: float) ray.tune.sample.Float [source]#
Sample a float uniformly between lower and upper.
- Parameters
lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
- bigdl.orca.automl.hp.quniform(lower: float, upper: float, q: float) ray.tune.sample.Float [source]#
Sample a float uniformly between lower and upper. Round the result to nearest value with granularity q, include upper.
- Parameters
lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
q – Granularity for increment.
- bigdl.orca.automl.hp.loguniform(lower: float, upper: float, base: int = 10) ray.tune.sample.Float [source]#
Sample a float between lower and upper. Power distribute uniformly between log_{base}(lower) and log_{base}(upper).
- Parameters
lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
base – Log base for distribution. Default to 10.
- bigdl.orca.automl.hp.qloguniform(lower: float, upper: float, q: float, base: int = 10) ray.tune.sample.Float [source]#
Sample a float between lower and upper. Power distribute uniformly between log_{base}(lower) and log_{base}(upper). Round the result to nearest value with granularity q, include upper.
- Parameters
lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
q – Granularity for increment.
base – Log base for distribution. Default to 10.
- bigdl.orca.automl.hp.randn(mean: float = 0.0, std: float = 1.0) ray.tune.sample.Float [source]#
Sample a float from normal distribution.
- Parameters
mean – Mean of the normal distribution. Default to 0.0.
std – Std of the normal distribution. Default to 1.0.
- bigdl.orca.automl.hp.qrandn(mean: float, std: float, q: float) ray.tune.sample.Float [source]#
Sample a float from normal distribution. Round the result to nearest value with granularity q.
- Parameters
mean – Mean of the normal distribution. Default to 0.0.
std – Std of the normal distribution. Default to 1.0.
q – Granularity for increment.
- bigdl.orca.automl.hp.randint(lower: int, upper: int) ray.tune.sample.Integer [source]#
Uniformly sample integer between lower and upper. (Both inclusive)
- Parameters
lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
- bigdl.orca.automl.hp.qrandint(lower: int, upper: int, q: int = 1) ray.tune.sample.Integer [source]#
Uniformly sample integer between lower and upper. (Both inclusive) Round the result to nearest value with granularity q.
- Parameters
lower – Lower bound of the sampling range.
upper – Upper bound of the sampling range.
q – Integer Granularity for increment.
- bigdl.orca.automl.hp.choice(categories: List) ray.tune.sample.Categorical [source]#
Uniformly sample from a list
- Parameters
categories – A list to be sampled.
- bigdl.orca.automl.hp.choice_n(categories: List, min_items: int, max_items: int) ray.tune.sample.Function [source]#
Sample a subset from a list
- Parameters
categories – A list to be sampled
min_items – minimum number of items to be sampled
max_items – maximum number of items to be sampled
orca.automl.metrics#
Evaluate unscaled metrics between y true value and y predicted value.
- bigdl.orca.automl.metrics.sMAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate Symmetric mean absolute percentage error (sMAPE).
\[\text{sMAPE} = \frac{100\%}{n} \sum_{t=1}^n \frac{|y_t-\hat{y_t}|}{|y_t|+|\hat{y_t}|}\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.MPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate mean percentage error (MPE).
\[\text{MPE} = \frac{100\%}{n}\sum_{t=1}^n \frac{y_t-\hat{y_t}}{y_t}\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.MAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate mean absolute percentage error (MAPE).
\[\text{MAPE} = \frac{100\%}{n}\sum_{t=1}^n |\frac{y_t-\hat{y_t}}{y_t}|\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.MDAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate Median Absolute Percentage Error (MDAPE).
\[\text{MDAPE} = 100\%\ median(|\frac{y_1-\hat{y_1}}{y_1}|, \ldots, |\frac{y_n-\hat{y_n}}{y_n}|)\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.sMDAPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate Symmetric Median Absolute Percentage Error (sMDAPE).
\[\text{sMDAPE} = 100\%\ median(\frac{|y_1-\hat{y_1}|}{|y_1|+|\hat{y_1}|}, \ldots, \frac{|y_n-\hat{y_n}|}{|y_n|+|\hat{y_n}|})\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.ME(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate Mean Error (ME).
\[\text{ME} = \frac{1}{n}\sum_{t=1}^n y_t-\hat{y_t}\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.MSPE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate mean squared percentage error (MSPE).
\[\text{MSPE} = \frac{100\%}{n}\sum_{t=1}^n (\frac{y_n-\hat{y_n}}{y_n})^2\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.MSLE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate the mean squared log error(MSLE).
\[\text{MSLE} = \frac{1}{n}\sum_{t=1}^n (log_e(1+y_t)-log_e(1+\hat{y_t}))^2\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.R2(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate the r2 score.
\[R^2 = 1-\frac{\sum_{t=1}^n (y_t-\hat{y_t})^2}{\sum_{t=1}^n (y_t-\bar{y})^2}\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 1.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.MAE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate the mean absolute error (MAE).
\[\text{MAE} = \frac{1}{n}\sum_{t=1}^n |y_t-\hat{y_t}|\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.RMSE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') Union[numpy.float64, numpy.ndarray] [source]#
Calculate square root of the mean squared error (RMSE).
\[\text{RMSE} = \sqrt{(\frac{1}{n}\sum_{t=1}^n (y_t-\hat{y_t})^2)}\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.MSE(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'uniform_average') Union[numpy.float64, numpy.ndarray] [source]#
Calculate the mean squared error (MSE).
\[\text{MSE} = \frac{1}{n}\sum_{t=1}^n (y_t-\hat{y_t})^2\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 0.0), or an array of floating point values, one for each individual target.
- bigdl.orca.automl.metrics.Accuracy(y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput=None) Union[numpy.float64, numpy.ndarray] [source]#
Calculate the accuracy score (Accuracy).
\[\text{Accuracy} = \frac{1}{n}\sum_{t=1}^n 1(y_t=\hat{y_t})\]- Parameters
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
- Returns
Float or ndarray of floats. A non-negative floating point value (the best value is 1.0), or an array of floating point values, one for each individual target.
- class bigdl.orca.automl.metrics.Evaluator[source]#
Bases:
object
Evaluate metrics for y_true and y_pred.
- static evaluate(metric: str, y_true: numpy.ndarray, y_pred: numpy.ndarray, multioutput: str = 'raw_values') numpy.float64 [source]#
Evaluate a specific metric for y_true and y_pred.
- Parameters
metric – String in [‘me’, ‘mae’, ‘mse’, ‘rmse’, ‘msle’, ‘r2’ , ‘mpe’, ‘mape’, ‘mspe’, ‘smape’, ‘mdape’, ‘smdape’, ‘accuracy’]
y_true – Array-like of shape = (n_samples, *). Ground truth (correct) target values.
y_pred – Array-like of shape = (n_samples, *). Estimated target values.
multioutput – String in [‘raw_values’, ‘uniform_average’]
- Returns
Float or ndarray of floats. A floating point value, or an array of floating point values, one for each individual target.
orca.automl.auto_xgb#
Automatic hyperparameter optimization for XGBoost models.
AutoXGBoost is inherited from AutoEstimator. You could refer to AutoEstimator API Guide for more APIs.
- class bigdl.orca.automl.xgboost.auto_xgb.AutoXGBClassifier(logs_dir: str = '/tmp/auto_xgb_classifier_logs', cpus_per_trial: int = 1, name: Optional[str] = None, remote_dir: Optional[str] = None, **xgb_configs)[source]#
Bases:
bigdl.orca.automl.auto_estimator.AutoEstimator
Automated xgboost classifier
Example
>>> search_space = {"n_estimators": hp.grid_search([50, 1000]), "max_depth": hp.grid_search([2, 15]), "lr": hp.loguniform(1e-4, 1e-1)} >>> auto_xgb_clf = AutoXGBClassifier(cpus_per_trial=4, name="auto_xgb_classifier", **config) >>> auto_xgb_clf.fit(data=(X_train, y_train), validation_data=(X_val, y_val), metric="error", metric_mode="min", n_sampling=1, search_space=search_space) >>> best_model = auto_xgb_clf.get_best_model()
- Parameters
logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_xgb_classifier_logs”
cpus_per_trial – Int. Number of cpus for each trial. It defaults to 1. The value will also be assigned to n_jobs in xgboost, which is the number of parallel threads used to run xgboost.
name – Name of the auto xgboost classifier.
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.
xgb_configs – Other scikit learn xgboost parameters. You may refer to https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn for the parameter names to specify. Note that we will directly use cpus_per_trial value for n_jobs in xgboost and you shouldn’t specify n_jobs again.
- fit(data: Union[partial, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[partial, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[int, float]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) None [source]#
Automatically fit the model and search for the best hyperparameters.
- Parameters
data – A Spark DataFrame, a tuple of ndarrays or a function. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data. If data is a function, it should takes config as argument and returns a tuple of ndarrays in the form of (x, y).
epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.
validation_data – Validation data. Validation data type should be the same as data.
metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.
metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.
metric_threshold – a trial will be terminated when metric threshold is met
n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.
search_space – a dict for search space
search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)
search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode
scheduler – str, all supported scheduler provided by ray tune
scheduler_params – parameters for scheduler
feature_cols – feature column names if data is Spark DataFrame.
label_cols – target column names if data is Spark DataFrame.
- class bigdl.orca.automl.xgboost.auto_xgb.AutoXGBRegressor(logs_dir: str = '/tmp/auto_xgb_regressor_logs', cpus_per_trial: int = 1, name: Optional[str] = None, remote_dir: Optional[str] = None, **xgb_configs)[source]#
Bases:
bigdl.orca.automl.auto_estimator.AutoEstimator
Automated xgboost regressor
Example
>>> search_space = {"n_estimators": hp.grid_search([800, 1000]), "max_depth": hp.grid_search([10, 15]), "lr": hp.loguniform(1e-4, 1e-1), "min_child_weight": hp.choice([1, 2, 3]), } >>> auto_xgb_reg = AutoXGBRegressor(cpus_per_trial=2, name="auto_xgb_regressor", **config) >>> auto_xgb_reg.fit(data=(X_train, y_train), validation_data=(X_val, y_val), metric="rmse", n_sampling=1, search_space=search_space) >>> best_model = auto_xgb_reg.get_best_model()
- Parameters
logs_dir – Local directory to save logs and results. It defaults to “/tmp/auto_xgb_classifier_logs”
cpus_per_trial – Int. Number of cpus for each trial. The value will also be assigned to n_jobs, which is the number of parallel threads used to run xgboost.
name – Name of the auto xgboost classifier.
remote_dir – String. Remote directory to sync training results and checkpoints. It defaults to None and doesn’t take effects while running in local. While running in cluster, it defaults to “hdfs:///tmp/{name}”.
xgb_configs – Other scikit learn xgboost parameters. You may refer to https://xgboost.readthedocs.io/en/latest/python/python_api.html#module-xgboost.sklearn for the parameter names to specify. Note that we will directly use cpus_per_trial value for n_jobs in xgboost and you shouldn’t specify n_jobs again.
- fit(data: Union[partial, Tuple[ndarray, ndarray], DataFrame], epochs: int = 1, validation_data: Optional[Union[partial, Tuple[ndarray, ndarray], DataFrame]] = None, metric: Optional[Union[Callable, str]] = None, metric_mode: Optional[str] = None, metric_threshold: Optional[Union[float, int]] = None, n_sampling: int = 1, search_space: Optional[Dict] = None, search_alg: Optional[str] = None, search_alg_params: Optional[Dict] = None, scheduler: Optional[str] = None, scheduler_params: Optional[Dict] = None, feature_cols: Optional[List[str]] = None, label_cols: Optional[List[str]] = None) None [source]#
Automatically fit the model and search for the best hyperparameters.
- Parameters
data – A Spark DataFrame, a tuple of ndarrays or a function. If data is a tuple of ndarrays, it should be in the form of (x, y), where x is training input data and y is training target data. If data is a function, it should takes config as argument and returns a tuple of ndarrays in the form of (x, y).
epochs – Max number of epochs to train in each trial. Defaults to 1. If you have also set metric_threshold, a trial will stop if either it has been optimized to the metric_threshold or it has been trained for {epochs} epochs.
validation_data – Validation data. Validation data type should be the same as data.
metric – String or customized evaluation metric function. If string, metric is the evaluation metric name to optimize, e.g. “mse”. If callable function, it signature should be func(y_true, y_pred), where y_true and y_pred are numpy ndarray. The function should return a float value as evaluation result.
metric_mode – One of [“min”, “max”]. “max” means greater metric value is better. You have to specify metric_mode if you use a customized metric function. You don’t have to specify metric_mode if you use the built-in metric in bigdl.orca.automl.metrics.Evaluator.
metric_threshold – a trial will be terminated when metric threshold is met
n_sampling – Number of times to sample from the search_space. Defaults to 1. If hp.grid_search is in search_space, the grid will be repeated n_sampling of times. If this is -1, (virtually) infinite samples are generated until a stopping condition is met.
search_space – a dict for search space
search_alg – str, all supported searcher provided by ray tune (i.e.”variant_generator”, “random”, “ax”, “dragonfly”, “skopt”, “hyperopt”, “bayesopt”, “bohb”, “nevergrad”, “optuna”, “zoopt” and “sigopt”)
search_alg_params – extra parameters for searcher algorithm besides search_space, metric and searcher mode
scheduler – str, all supported scheduler provided by ray tune
scheduler_params – parameters for scheduler
feature_cols – feature column names if data is Spark DataFrame.
label_cols – target column names if data is Spark DataFrame.