TSDataset#

TSDataset#

Time series data is a special data formulation with specific operations. TSDataset is an abstract of time series dataset, which provides various data processing operations (e.g. impute, deduplicate, resample, scale/unscale, roll) and feature engineering methods (e.g. datetime feature, aggregation feature). Cascade call is supported for most of the methods. TSDataset can be initialized from a pandas dataframe and be converted to a pandas dataframe or numpy ndarray.

class bigdl.chronos.data.tsdataset.TSDataset(data, repair=False, **schema)[source]#

Bases: object

TSDataset is an abstract of time series dataset. Cascade call is supported for most of the transform methods.

static from_pandas(df, dt_col, target_col, id_col=None, extra_feature_col=None, with_split=False, val_ratio=0, test_ratio=0.1, repair=False, deploy_mode=False)[source]#

Initialize tsdataset(s) from pandas dataframe.

Parameters
  • df – a pandas dataframe for your raw time series data.

  • dt_col – a str indicates the col name of datetime column in the input data frame, the dt_col must be sorted from past to latest respectively for each id.

  • target_col – a str or list indicates the col name of target column in the input data frame.

  • id_col – (optional) a str indicates the col name of dataframe id. If it is not explicitly stated, then the data is interpreted as only containing a single id.

  • extra_feature_col – (optional) a str or list indicates the col name of extra feature columns that needs to predict the target column.

  • with_split – (optional) bool, states if we need to split the dataframe to train, validation and test set. The value defaults to False.

  • val_ratio – (optional) float, validation ratio. Only effective when with_split is set to True. The value defaults to 0.

  • test_ratio – (optional) float, test ratio. Only effective when with_split is set to True. The value defaults to 0.1.

  • repair – a bool indicates whether automaticly repair low quality data, which may call .impute()/.resample() or modify datetime column on dataframe. The value defaults to False.

  • deploy_mode – a bool indicates whether to use deploy mode, which will be used in production environment to reduce the latency of data processing. The value defaults to False.

Returns

a TSDataset instance when with_split is set to False, three TSDataset instances when with_split is set to True.

Create a tsdataset instance by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> tsdataset = TSDataset.from_pandas(df, dt_col="datetime",
>>>                                   target_col="value", id_col="id",
>>>                                   extra_feature_col=["extra feature 1",
>>>                                                      "extra feature 2"])
static from_parquet(path, dt_col, target_col, id_col=None, extra_feature_col=None, with_split=False, val_ratio=0, test_ratio=0.1, repair=False, deploy_mode=False, **kwargs)[source]#

Initialize tsdataset(s) from path of parquet file.

Parameters
  • path – A string path to parquet file. The string could be a URL. Valid URL schemes include hdfs, http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files.

  • dt_col – a str indicates the col name of datetime column in the input data frame.

  • target_col – a str or list indicates the col name of target column in the input data frame.

  • id_col – (optional) a str indicates the col name of dataframe id. If it is not explicitly stated, then the data is interpreted as only containing a single id.

  • extra_feature_col – (optional) a str or list indicates the col name of extra feature columns that needs to predict the target column.

  • with_split – (optional) bool, states if we need to split the dataframe to train, validation and test set. The value defaults to False.

  • val_ratio – (optional) float, validation ratio. Only effective when with_split is set to True. The value defaults to 0.

  • test_ratio – (optional) float, test ratio. Only effective when with_split is set to True. The value defaults to 0.1.

  • repair – a bool indicates whether automaticly repair low quality data, which may call .impute()/.resample() or modify datetime column on dataframe. The value defaults to False.

  • deploy_mode – a bool indicates whether to use deploy mode, which will be used in production environment to reduce the latency of data processing. The value defaults to False.

  • kwargs – Any additional kwargs are passed to the pd.read_parquet and pyarrow.parquet.read_table.

Returns

a TSDataset instance when with_split is set to False, three TSDataset instances when with_split is set to True.

Create a tsdataset instance by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> tsdataset = TSDataset.from_parquet("hdfs://path/to/table.parquet", dt_col="datetime",
>>>                                   target_col="value", id_col="id",
>>>                                   extra_feature_col=["extra feature 1",
>>>                                                      "extra feature 2"])
static from_prometheus(prometheus_url, query, starttime, endtime, step, target_col=None, id_col=None, extra_feature_col=None, with_split=False, val_ratio=0, test_ratio=0.1, repair=False, deploy_mode=False, **kwargs)[source]#

Initialize tsdataset(s) from Prometheus data for specified time period via url.

Parameters
  • prometheus_url – a str indicates url of a Prometheus server.

  • query – a Prometheus expression query str or list.

  • starttime – start timestamp of the specified time period, RFC-3339 string or as a Unix timestamp in seconds.

  • endtime – end timestamp of the specified time period, RFC-3339 string or as a Unix timestamp in seconds.

  • step – a str indicates query resolution step width in Prometheus duration format or float number of seconds. More information about Prometheus time durations are here: https://prometheus.io/docs/prometheus/latest/querying/basics/#time-durations

  • target_col – (optional) a Prometheus expression query str or list indicates the col name of target column in the input data frame. If it is not explicitly stated, then target column is automatically specified according to the Prometheus data.

  • id_col – (optional) a Prometheus expression query str indicates the col name of dataframe id. If it is not explicitly stated, then the data is interpreted as only containing a single id.

  • extra_feature_col – (optional) a Prometheus expression query str or list indicates the col name of extra feature columns that needs to predict the target column. If it is not explicitly stated, then extra feature column is None.

  • with_split – (optional) bool, states if we need to split the dataframe to train, validation and test set. The value defaults to False.

  • val_ratio – (optional) float, validation ratio. Only effective when with_split is set to True. The value defaults to 0.

  • test_ratio – (optional) float, test ratio. Only effective when with_split is set to True. The value defaults to 0.1.

  • repair – a bool indicates whether automaticly repair low quality data, which may call .impute()/.resample() or modify datetime column on dataframe. The value defaults to False.

  • deploy_mode – a bool indicates whether to use deploy mode, which will be used in production environment to reduce the latency of data processing. The value defaults to False.

  • kwargs – Any additional kwargs are passed to the Prometheus query, such as timeout.

Returns

a TSDataset instance when with_split is set to False, three TSDataset instances when with_split is set to True.

Create a tsdataset instance by:

>>> # Here is an example:
>>> tsdataset = TSDataset.from_prometheus(prometheus_url="http://localhost:9090",
>>>                                       query="collectd_cpufreq{cpufreq="0"}",
>>>                                       starttime="2022-09-01T00:00:00Z",
>>>                                       endtime="2022-10-01T00:00:00Z",
>>>                                       step="1h")
impute(mode='last', const_num=0)[source]#

Impute the tsdataset by imputing each univariate time series distinguished by id_col and feature_col.

Parameters
  • mode

    imputation mode, select from “last”, “const” or “linear”.

    ”last”: impute by propagating the last non N/A number to its following N/A. if there is no non N/A number ahead, 0 is filled instead.

    ”const”: impute by a const value input by user.

    ”linear”: impute by linear interpolation.

  • const_num – indicates the const number to fill, which is only effective when mode is set to “const”.

Returns

the tsdataset instance.

deduplicate()[source]#

Remove those duplicated records which has exactly the same values in each feature_col for each multivariate timeseries distinguished by id_col.

Returns

the tsdataset instance.

resample(interval, start_time=None, end_time=None, merge_mode='mean')[source]#

Resample on a new interval for each univariate time series distinguished by id_col and feature_col.

Parameters
  • interval – pandas offset aliases, indicating time interval of the output dataframe.

  • start_time – start time of the output dataframe.

  • end_time – end time of the output dataframe.

  • merge_mode – if current interval is smaller than output interval, we need to merge the values in a mode. “max”, “min”, “mean” or “sum” are supported for now.

Returns

the tsdataset instance.

repair_abnormal_data(mode='relative', threshold=3.0)[source]#

Repair the tsdataset by replacing abnormal data detected based on threshold with the last non N/A number.

Parameters
  • mode

    detect abnormal data mode, select from “absolute” or “relative”.

    ”absolute”: detect abnormal data by comparing with max and min value.

    ”relative”: detect abnormal data by comparing with mean value plus/minus several times standard deviation.

  • threshold – indicates the range of comparison. It is a 2-dim tuple of float (min_value, max_value) when mode is set to “absolute” while it is a float number when mode is set to “relative”.

Returns

the tsdataset instance.

gen_dt_feature(features='auto', one_hot_features=None)[source]#

Generate datetime feature(s) for each record.

Parameters
  • features – str or list, states which feature(s) will be generated. If the value is set to be a str, it should be one of “auto” or “all”. For “auto”, a subset of datetime features will be generated under the consideration of the sampling frequency of your data. For “all”, the whole set of datetime features will be generated. If the value is set to be a list, the list should contain the features you want to generate. A table of all datatime features and their description is listed below. The value defaults to “auto”.

  • one_hot_features – list, states which feature(s) will be generated as one-hot-encoded feature. The value defaults to None, which means no features will be generated with one-hot-encoded.

“MINUTE”: The minute of the time stamp.
“DAY”: The day of the time stamp.
“DAYOFYEAR”: The ordinal day of the year of the time stamp.
“HOUR”: The hour of the time stamp.
“WEEKDAY”: The day of the week of the time stamp, Monday=0, Sunday=6.
“WEEKOFYEAR”: The ordinal week of the year of the time stamp.
“MONTH”: The month of the time stamp.
“YEAR”: The year of the time stamp.
“IS_AWAKE”: Bool value indicating whether it belongs to awake hours for the time stamp,
True for hours between 6A.M. and 1A.M.
“IS_BUSY_HOURS”: Bool value indicating whether it belongs to busy hours for the time
stamp, True for hours between 7A.M. and 10A.M. and hours between 4P.M. and 8P.M.
“IS_WEEKEND”: Bool value indicating whether it belongs to weekends for the time stamp,
True for Saturdays and Sundays.
Returns

the tsdataset instance.

gen_global_feature(settings='comprehensive', full_settings=None, n_jobs=1)[source]#

Generate per-time-series feature for each time series. This method will be implemented by tsfresh. Make sure that the specified column name does not contain ‘__’.

Parameters
  • settings – str or dict. If a string is set, then it must be one of “comprehensive” “minimal” and “efficient”. If a dict is set, then it should follow the instruction for default_fc_parameters in tsfresh. The value is defaulted to “comprehensive”.

  • full_settings – dict. It should follow the instruction for kind_to_fc_parameters in tsfresh. The value is defaulted to None.

  • n_jobs – int. The number of processes to use for parallelization.

Returns

the tsdataset instance.

gen_rolling_feature(window_size, settings='comprehensive', full_settings=None, n_jobs=1)[source]#

Generate aggregation feature for each sample. This method will be implemented by tsfresh. Make sure that the specified column name does not contain ‘__’.

Parameters
  • window_size – int, generate feature according to the rolling result.

  • settings – str or dict. If a string is set, then it must be one of “comprehensive” “minimal” and “efficient”. If a dict is set, then it should follow the instruction for default_fc_parameters in tsfresh. The value is defaulted to “comprehensive”.

  • full_settings – dict. It should follow the instruction for kind_to_fc_parameters in tsfresh. The value is defaulted to None.

  • n_jobs – int. The number of processes to use for parallelization.

Returns

the tsdataset instance.

roll(horizon, lookback='auto', feature_col=None, target_col=None, id_sensitive=False, time_enc=False, label_len=0, is_predict=False)[source]#

Sampling by rolling for machine learning/deep learning models.

Parameters
  • lookback – int, lookback value. Default to ‘auto’, if ‘auto’, the mode of time series’ cycle length will be taken as the lookback.

  • horizon – int or list. If horizon is an int, we will sample horizon step continuously after the forecasting point. If horizon is a list, we will sample discretely according to the input list. 1 means the timestamp just after the observed data. specially, when horizon is set to 0, ground truth will be generated as None.

  • feature_col – str or list, indicates the feature col name. Default to None, where we will take all available feature in rolling.

  • target_col – str or list, indicates the target col name. Default to None, where we will take all target in rolling. It should be a subset of target_col you used to initialize the tsdataset.

  • id_sensitive

    bool. If id_sensitive is False, we will rolling on each id’s sub dataframe and fuse the sampings. The shape of rolling will be x: (num_sample, lookback, num_feature_col + num_target_col) y: (num_sample, horizon + label_len, num_target_col) where num_sample is the summation of sample number of each dataframe.

    If id_sensitive is True, we will rolling on the wide dataframe whose columns are cartesian product of id_col and feature_col. The shape of rolling will be x: (num_sample, lookback, new_num_feature_col + new_num_target_col) y: (num_sample, horizon + label_len, new_num_target_col) where num_sample is the sample number of the wide dataframe, new_num_feature_col is the product of the number of id and the number of feature_col, new_num_target_col is the product of the number of id and the number of target_col.

  • time_enc – bool. This parameter should be set to True only when you are using Autoformer model. With time_enc to be true, 2 additional numpy ndarray will be returned when you call .to_numpy(). Be sure to have a time type for dt_col if you set time_enc to True.

  • label_len – int. This parameter should be set to True only when you are using Autoformer model. This indicates the length of overlap area of output(y) and input(x) on time axis.

  • is_predict – bool. This parameter indicates if the dataset will be sampled as a prediction dataset (without groud truth).

Returns

the tsdataset instance.

roll() can be called by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> tsdataset = TSDataset.from_pandas(df, dt_col="datetime",
>>>                                   target_col="value", id_col="id",
>>>                                   extra_feature_col=["extra feature 1",
>>>                                                      "extra feature 2"])
>>> horizon, lookback = 1, 1
>>> tsdataset.roll(lookback=lookback, horizon=horizon, id_sensitive=False)
>>> x, y = tsdataset.to_numpy()
>>> print(x, y) # x = [[[1.9, 1, 2 ]], [[2.3, 0, 9 ]]] y = [[[ 2.4 ]], [[ 2.6 ]]]
>>> print(x.shape, y.shape) # x.shape = (2, 1, 3) y.shape = (2, 1, 1)
>>> tsdataset.roll(lookback=lookback, horizon=horizon, id_sensitive=True)
>>> x, y = tsdataset.to_numpy()
>>> print(x, y) # x = [[[ 1.9, 2.3, 1, 2, 0, 9 ]]] y = [[[ 2.4, 2.6]]]
>>> print(x.shape, y.shape) # x.shape = (1, 1, 6) y.shape = (1, 1, 2)
to_torch_data_loader(batch_size=32, roll=True, lookback='auto', horizon=None, feature_col=None, target_col=None, shuffle=True, time_enc=False, label_len=0, is_predict=False)[source]#

Convert TSDataset to a PyTorch DataLoader with or without rolling. We recommend to use to_torch_data_loader(default roll=True) if you don’t need to output the rolled numpy array. It is much more efficient than rolling separately, especially when the dataframe or lookback is large.

Parameters
  • batch_size – int, the batch_size for a Pytorch DataLoader. It defaults to 32.

  • roll – Boolean. Whether to roll the dataframe before converting to DataLoader. If True, you must also specify lookback and horizon for rolling. If False, you must have called tsdataset.roll() before calling to_torch_data_loader(). Default to True.

  • lookback – int, lookback value. Default to ‘auto’, the mode of time series’ cycle length will be taken as the lookback.

  • horizon – int or list, if horizon is an int, we will sample horizon step continuously after the forecasting point. if horizon is a list, we will sample discretely according to the input list. specially, when horizon is set to 0, ground truth will be generated as None.

  • feature_col – str or list, indicates the feature col name. Default to None, where we will take all available feature in rolling.

  • target_col – str or list, indicates the target col name. Default to None, where we will take all target in rolling. it should be a subset of target_col you used to initialize the tsdataset.

  • shuffle – if the dataloader is shuffled. default to True.

  • time_enc – bool, This parameter should be set to True only when you are using Autoformer model. With time_enc to be true, 2 additional numpy ndarray will be returned when you call .to_numpy(). Be sure to have a time type for dt_col if you set time_enc to True.

  • label_len – int, This parameter should be set to True only when you are using Autoformer model. This indicates the length of overlap area of output(y) and input(x) on time axis.

  • is_predict – bool, This parameter should be set to True only when you are processing test data without accuracy evaluation. This indicates if the dataset will be sampled as a prediction dataset(without groud truth).

Returns

A pytorch DataLoader instance. The data returned from dataloader is in the

following form:

1. a 3d numpy ndarray when is_predict=True or horizon=0 and time_enc=False 2. a 2-dim tuple of 3d numpy ndarray (x, y) when is_predict=False and horizon != 0 and time_enc=False 3. a 4-dim tuple of 3d numpy ndarray (x, y, x_enc, y_enc) when time_enc=True

to_torch_data_loader() can be called by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> tsdataset = TSDataset.from_pandas(df, dt_col="datetime",
>>>                                   target_col="value", id_col="id",
>>>                                   extra_feature_col=["extra feature 1",
>>>                                                      "extra feature 2"])
>>> horizon, lookback = 1, 1
>>> data_loader = tsdataset.to_torch_data_loader(batch_size=32,
>>>                                              lookback=lookback,
>>>                                              horizon=horizon)
>>> # or roll outside. That might be less efficient than the way above.
>>> tsdataset.roll(lookback=lookback, horizon=horizon, id_sensitive=False)
>>> x, y = tsdataset.to_numpy()
>>> print(x, y) # x = [[[1.9, 1, 2 ]], [[2.3, 0, 9 ]]] y = [[[ 2.4 ]], [[ 2.6 ]]]
>>> data_loader = tsdataset.to_torch_data_loader(batch_size=32, roll=False)
to_tf_dataset(batch_size=32, shuffle=False)[source]#

Export a Dataset whose elements are slices of the given tensors.

Parameters

batch_size – Number of samples per batch of computation. If unspecified, batch_size will default to 32.

Returns

a tf.data dataset, including x and y.

to_numpy()[source]#
Export rolling result in form of :
  1. a 3d numpy ndarray when is_predict=True or horizon=0 and time_enc=False

  2. a 2-dim tuple of 3d numpy ndarray (x, y) when is_predict=False and horizon != 0 and time_enc=False

  3. a 4-dim tuple of 3d numpy ndarray (x, y, x_enc, y_enc) when time_enc=True

Returns

a 3d numpy ndarray when is_predict=True or horizon=0 and time_enc=False. or a 2-dim tuple of 3d numpy ndarray (x, y) when is_predict=False and horizon != 0 and time_enc=False or a 4-dim tuple of 3d numpy ndarray (x, y, x_enc, y_enc) when time_enc=True. The ndarray is casted to float32.

to_pandas()[source]#

Export the pandas dataframe.

Returns

the internal dataframe.

scale(scaler, fit=True)[source]#

Scale the time series dataset’s feature column and target column.

Parameters
  • scaler – sklearn scaler instance, StandardScaler, MaxAbsScaler, MinMaxScaler and RobustScaler are supported.

  • fit – if we need to fit the scaler. Typically, the value should be set to True for training set, while False for validation and test set. The value is defaulted to True.

Returns

the tsdataset instance.

Assume there is a training set tsdata and a test set tsdata_test. scale() should be called first on training set with default value fit=True, then be called on test set with the same scaler and fit=False.

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = StandardScaler()
>>> tsdata.scale(scaler, fit=True)
>>> tsdata_test.scale(scaler, fit=False)
unscale()[source]#

Unscale the time series dataset’s feature column and target column.

Returns

the tsdataset instance.

unscale_numpy(data)[source]#

Unscale the time series forecaster’s numpy prediction result/ground truth.

Parameters

data – a numpy ndarray with 3 dim whose shape should be exactly the same with self.numpy_y.

Returns

the unscaled numpy ndarray.

get_cycle_length(aggregate='mode', top_k=3)[source]#

Calculate the cycle length of the time series in this TSDataset.

Parameters
  • top_k (int) – The freq with top top_k power after fft will be used to check the autocorrelation. Higher top_k might be time-consuming. The value is default to 3.

  • aggregate (str) – Select the mode of calculation time period, We only support ‘min’, ‘max’, ‘mode’, ‘median’, ‘mean’.

Returns

Describe the value of the time period distribution.

export_jit(path_dir=None, drop_dt_col=True)[source]#

Exporting data processing pipeline to torchscript so that it can be used without Python environment. For example, when you are deploying a trained model in C++ and need to process input data, you can call this method to get a torchscript module containing the data processing pipeline and save it in a .pt file when you finish developing the model, when deploying, you can load the torchscript module from .pt file and run the data processing pipeline in C++ using libtorch APIs, and the output tensor can be fed into the trained model for inference.

Currently we support exporting preprocessing (scale and roll) and postprocessing (unscale) to torchscript, they can do the same thing as the following code:

>>> # preprocess
>>> tsdata.scale(scaler, fit=False) \
>>>       .roll(lookback, horizon, is_predict=True)
>>> preprocess_output = tsdata.to_numpy()
>>> # postprocess
>>> # "data" can be the output of model inference
>>> postprocess_output = tsdata.unscale_numpy(data)

Preprocessing and postprocessing will be converted to separate torchscript modules, so two modules will be returned and saved.

When deploying, the compiled torchscript module can be used by:

>>> // deployment in C++
>>> #include <torch/torch.h>
>>> #include <torch/script.h>
>>> // create input tensor from your data
>>> // the data to create input tensor should have the same format as the
>>> // data used in developing
>>> torch::Tensor input_tensor = create_input_tensor(data);
>>> // load the module
>>> torch::jit::script::Module preprocessing;
>>> preprocessing = torch::jit::load(preprocessing_path);
>>> // run data preprocessing
>>> torch::Tensor preprocessing_output = preprocessing.forward(input_tensor).toTensor();
>>> // inference using your trained model
>>> torch::Tensor inference_output = trained_model(preprocessing_output)
>>> // load the postprocessing module
>>> torch::jit::script::Module postprocessing;
>>> postprocessing = torch::jit::load(postprocessing_path);
>>> // run postprocessing
>>> torch::Tensor output = postprocessing.forward(inference_output).toTensor()
Currently there are some limitations:
  1. Please make sure the value of each column can be converted to Pytorch tensor, for example, id “00” is not allowed because str can not be converted to a tensor, you should use integer (0, 1, ..) as id instead of string.

  2. Some features in tsdataset.scale and tsdataset.roll are unavailable in this pipeline:

    1. If self.roll_additional_feature is not None, it can’t be processed in scale and roll

    2. id_sensitive, time_enc and label_len parameter is not supported in roll

  3. Users are expected to call .scale(scaler, fit=True) before calling export_jit. Single roll operation is not supported for converting now.

Parameters
  • path_dir – The path to save the compiled torchscript modules, default to None. If set to None, you should call torch.jit.save() in your code to save the returned modules; if not None, the path should be a directory, and the modules will be saved at “path_dir/tsdata_preprocessing.pt” and “path_dir/tsdata_postprocessing.pt”.

  • drop_dtcol – Whether to delete the datetime column, defaults to True. Since datetime value (like “2022-12-12”) can’t be converted to Pytorch tensor, you can choose different ways to workaround this. If set to True, the datetime column will be deleted, then you also need to skip the datetime column when reading data from data source (like csv files) in deployment environment to keep the same structure as the data used in development; if set to False, the datetime column will not be deleted, and you need to make sure the datetime colunm can be successfully converted to Pytorch tensor when reading data in deployment environment. For example, you can set each data in datetime column to an int (or other vaild types) value, since datetime column is not necessary in preprocessing and postprocessing, the value can be arbitrary.

Returns

A tuple (preprocessing_module, postprocessing_module) containing the compiled torchscript modules.

XShardsTSDataset#

Time series data is a special data formulation with specific operations. XShardsTSDataset is an abstract of time series dataset, which provides various data processing operations (e.g. impute, deduplicate, resample, scale/unscale, roll) and feature engineering methods (e.g. datetime feature, aggregation feature). Cascade call is supported for most of the methods. XShardsTSDataset can be initialized from xshards of pandas dataframe and be converted to xshards of numpy in an distributed and parallized fashion.

class bigdl.chronos.data.experimental.xshards_tsdataset.XShardsTSDataset(shards, **schema)[source]#

Bases: object

XShardTSDataset is an abstract of time series dataset with distributed fashion. Cascade call is supported for most of the transform methods. XShardTSDataset will partition the dataset by id_col, which is experimental.

static from_xshards(shards, dt_col, target_col, id_col=None, extra_feature_col=None, with_split=False, val_ratio=0, test_ratio=0.1)[source]#

Initialize xshardtsdataset(s) from xshard pandas dataframe.

Parameters
  • shards – an xshards pandas dataframe for your raw time series data.

  • dt_col – a str indicates the col name of datetime column in the input data frame.

  • target_col – a str or list indicates the col name of target column in the input data frame.

  • id_col – (optional) a str indicates the col name of dataframe id. If it is not explicitly stated, then the data is interpreted as only containing a single id.

  • extra_feature_col – (optional) a str or list indicates the col name of extra feature columns that needs to predict the target column.

  • with_split – (optional) bool, states if we need to split the dataframe to train, validation and test set. The value defaults to False.

  • val_ratio – (optional) float, validation ratio. Only effective when with_split is set to True. The value defaults to 0.

  • test_ratio – (optional) float, test ratio. Only effective when with_split is set to True. The value defaults to 0.1.

Returns

a XShardTSDataset instance when with_split is set to False, three XShardTSDataset instances when with_split is set to True.

Create a xshardtsdataset instance by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> from bigdl.orca.data.pandas import read_csv
>>> shards = read_csv(csv_path)
>>> tsdataset = XShardsTSDataset.from_xshards(shards, dt_col="datetime",
>>>                                           target_col="value", id_col="id",
>>>                                           extra_feature_col=["extra feature 1",
>>>                                                              "extra feature 2"])
static from_sparkdf(df, dt_col, target_col, id_col=None, extra_feature_col=None, with_split=False, val_ratio=0, test_ratio=0.1)[source]#

Initialize xshardtsdataset(s) from Spark Dataframe.

Parameters
  • df – an Spark DataFrame for your raw time series data.

  • dt_col – a str indicates the col name of datetime column in the input data frame.

  • target_col – a str or list indicates the col name of target column in the input data frame.

  • id_col – (optional) a str indicates the col name of dataframe id. If it is not explicitly stated, then the data is interpreted as only containing a single id.

  • extra_feature_col – (optional) a str or list indicates the col name of extra feature columns that needs to predict the target column.

  • with_split – (optional) bool, states if we need to split the dataframe to train, validation and test set. The value defaults to False.

  • val_ratio – (optional) float, validation ratio. Only effective when with_split is set to True. The value defaults to 0.

  • test_ratio – (optional) float, test ratio. Only effective when with_split is set to True. The value defaults to 0.1.

Returns

a XShardTSDataset instance when with_split is set to False, three XShardTSDataset instances when with_split is set to True.

Create a xshardtsdataset instance by:

>>> # Here is a df example:
>>> # id        datetime      value   "extra feature 1"   "extra feature 2"
>>> # 00        2019-01-01    1.9     1                   2
>>> # 01        2019-01-01    2.3     0                   9
>>> # 00        2019-01-02    2.4     3                   4
>>> # 01        2019-01-02    2.6     0                   2
>>> df = <pyspark.sql.dataframe.DataFrame>
>>> tsdataset = XShardsTSDataset.from_xshards(df, dt_col="datetime",
>>>                                           target_col="value", id_col="id",
>>>                                           extra_feature_col=["extra feature 1",
>>>                                                              "extra feature 2"])
roll(lookback, horizon, feature_col=None, target_col=None, id_sensitive=False)[source]#

Sampling by rolling for machine learning/deep learning models.

Parameters
  • lookback – int, lookback value.

  • horizon – int or list, if horizon is an int, we will sample horizon step continuously after the forecasting point. if horizon is a list, we will sample discretely according to the input list. specially, when horizon is set to 0, ground truth will be generated as None.

  • feature_col – str or list, indicates the feature col name. Default to None, where we will take all available feature in rolling.

  • target_col – str or list, indicates the target col name. Default to None, where we will take all target in rolling. it should be a subset of target_col you used to initialize the xshardtsdataset.

  • id_sensitive – bool, |if id_sensitive is False, we will rolling on each id’s sub dataframe |and fuse the sampings. |The shape of rolling will be |x: (num_sample, lookback, num_feature_col + num_target_col) |y: (num_sample, horizon, num_target_col) |where num_sample is the summation of sample number of each dataframe |if id_sensitive is True, we have not implement this currently.

Returns

the xshardtsdataset instance.

scale(scaler, fit=True)[source]#

Scale the time series dataset’s feature column and target column.

Parameters
  • scaler – a dictionary of scaler instance, where keys are id name and values are corresponding scaler instance. e.g. if you have 2 ids called “id1” and “id2”, a legal scaler input can be {“id1”: StandardScaler(), “id2”: StandardScaler()}

  • fit – if we need to fit the scaler. Typically, the value should be set to True for training set, while False for validation and test set. The value is defaulted to True.

Returns

the xshardtsdataset instance.

Assume there is a training set tsdata and a test set tsdata_test. scale() should be called first on training set with default value fit=True, then be called on test set with the same scaler and fit=False.

>>> from sklearn.preprocessing import StandardScaler
>>> scaler = {"id1": StandardScaler(), "id2": StandardScaler()}
>>> tsdata.scale(scaler, fit=True)
>>> tsdata_test.scale(scaler, fit=False)
unscale()[source]#

Unscale the time series dataset’s feature column and target column.

Returns

the xshardtsdataset instance.

unscale_xshards(data, key=None)[source]#

Unscale the time series forecaster’s numpy prediction result/ground truth.

Parameters
  • data – xshards same with self.numpy_xshards.

  • key – str, ‘y’ or ‘prediction’, default to ‘y’. if no “prediction”

or “y” return an error and require our users to input a key. if key is None, key will be set ‘prediction’.

Returns

the unscaled xshardtsdataset instance.

impute(mode='last', const_num=0)[source]#

Impute the tsdataset by imputing each univariate time series distinguished by id_col and feature_col.

Parameters
  • mode

    imputation mode, select from “last”, “const” or “linear”.

    ”last”: impute by propagating the last non N/A number to its following N/A. if there is no non N/A number ahead, 0 is filled instead.

    ”const”: impute by a const value input by user.

    ”linear”: impute by linear interpolation.

  • const_num – indicates the const number to fill, which is only effective when mode is set to “const”.

Returns

the tsdataset instance.

to_xshards(partition_num=None)[source]#

Export rolling result in form of a dict of numpy ndarray {‘x’: …, ‘y’: …, ‘id’: …}, where value for ‘x’ and ‘y’ are 3-dim numpy ndarray and value for ‘id’ is 2-dim ndarray with shape (batch_size, 1)

Parameters

partition_num – how many partition you would like to split your data.

Returns

a 3-element dict xshard. each value is a 3d numpy ndarray. The ndarray is casted to float32. Default to None which will partition according to id.

Built-in Dataset#

Built-in dataset can be downloaded and preprocessed by this function. Train, validation and test split is also supported.

bigdl.chronos.data.repo_dataset.get_public_dataset(name, path='~/.chronos/dataset', redownload=False, **kwargs)[source]#

Get public dataset.

>>> from bigdl.chronos.data import get_public_dataset
>>> tsdata_network_traffic = get_public_dataset(name="network_traffic")
Parameters
  • name – str, public dataset name, e.g. “network_traffic”. We only support network_traffic, AIOps, fsi, nyc_taxi, uci_electricity, uci_electricity_wide, tsinghua_electricity

  • path – str, download path, the value defatults to “~/.chronos/dataset/”.

  • redownload – bool, if redownload the raw dataset file(s).

  • kwargs – extra arguments passed to initialize the tsdataset, including with_split, val_ratio and test_ratio.

bigdl.chronos.data.repo_dataset.gen_synthetic_data(len=10000, sine_amplitude=10.0, angular_freq=0.01, noise_amplitude=0.01, noise_scale=1.0, seed=1, time_freq='D', **kwargs)[source]#

Generate dataset according to sine function with a Gaussian noise. Datetime is generated according to time_freq with the current time as endtime.

>>> from bigdl.chronos.data import gen_synthetic_data
>>> tsdata_gen = gen_synthetic_data()
Parameters
  • len – int, the number indicates the dataset size. Default to 10000.

  • sine_amplitude – float, the number indicates amplitude of the sine function. Default to 10.0.

  • angular_freq – float, the number indicates angular frequency of the sine function. Default to 0.01.

  • noise_amplitude – float, the number indicates amplitude of the Gaussian noise. Default to 0.01.

  • noise_scale – float, the number indicates the standard deviation of the Gaussian noise while the mean is set to 0. Default to 1.0.

  • seed – int, random seed for generating Gaussian noise. Default to 1.

  • time_freq – str, the frequency of the generated dataframe, default to ‘D’(calendar day frequency). The frequency can be anything from the pandas list of frequency strings here: https://pandas.pydata.org/docs/user_guide/timeseries.html#timeseries-offset-aliases

  • kwargs – extra arguments passed to initialize the tsdataset, including with_split, val_ratio and test_ratio.

Returns

a TSDataset instance when with_split is set to False, three TSDataset instances when with_split is set to True.