Data Processing and Feature Engineering#
Time series data is a special data formulation with its specific operations. Chronos provides TSDataset
as a time series dataset abstract for data processing (e.g. impute, deduplicate, resample, scale/unscale, roll sampling) and auto feature engineering (e.g. datetime feature, aggregation feature). Chronos also provides XShardsTSDataset
with same(or similar) API for distributed and parallelized data preprocessing on large data.
Users can create a TSDataset
quickly from many raw data types, including pandas dataframe, parquet files, spark dataframe or xshards objects. TSDataset
can be directly used in AutoTSEstimator
and forecasters. It can also be converted to pandas dataframe, numpy ndarray, pytorch dataloaders or tensorflow dataset for various usage.
1. Basic concepts#
A time series can be interpreted as a sequence of real value whose order is timestamp. While a time series dataset can be a combination of one or a huge amount of time series. It may contain multiple time series since users may collect different time series in the same/different period of time (e.g. An AIops dataset may have CPU usage ratio and memory usage ratio data for two servers at a period of time. This dataset contains four time series).
In TSDataset
and XShardsTSDataset
, we provide 2 possible dimensions to construct a high dimension time series dataset (i.e. feature dimension and id dimension).
feature dimension: Time series along this dimension might be independent or related. Though they may be related, they are assumed to have different patterns and distributions and collected on the same period of time. For example, the CPU usage ratio and Memory usage ratio for the same server at a period of time.
id dimension: Time series along this dimension are assumed to have the same patterns and distributions and might by collected on the same or different period of time. For example, the CPU usage ratio for two servers at a period of time.
All the preprocessing operations will be done on each independent time series(i.e on both feature dimension and id dimension), while feature scaling will be only carried out on the feature dimension.
Note
XShardsTSDataset
will perform the data processing in parallel(based on spark) to support large dataset. While the parallelization will only be performed on “id dimension”. This means, in previous example, XShardsTSDataset
will only utilize multiple workers to process data for different servers at the same time. If a dataset only has 1 id, XShardsTSDataset
will be even slower than TSDataset
because of the overhead.
2. Create a TSDataset#
TSDataset
supports initializing from a pandas dataframe through TSDataset.from_pandas
, from a parquet file through TSDataset.from_parquet
or from Prometheus data through TSDataset.from_prometheus
.
XShardsTSDataset
supports initializing from an xshards object through XShardsTSDataset.from_xshards
or from a Spark Dataframe through XShardsTSDataset.from_sparkdf
.
A typical valid time series dataframe df
is shown below.
You can initialize a XShardsTSDataset
or TSDataset
by simply:
# Server id Datetime CPU usage Mem usage
# 0 08:39 2021/7/9 93 24
# 0 08:40 2021/7/9 91 24
# 0 08:41 2021/7/9 93 25
# 0 ... ... ...
# 1 08:39 2021/7/9 73 79
# 1 08:40 2021/7/9 72 80
# 1 08:41 2021/7/9 79 80
# 1 ... ... ...
from bigdl.chronos.data import TSDataset
tsdata = TSDataset.from_pandas(df,
dt_col="Datetime",
id_col="Server id",
target_col=["CPU usage",
"Mem usage"])
# Here is a df example:
# id datetime value "extra feature 1" "extra feature 2"
# 00 2019-01-01 1.9 1 2
# 01 2019-01-01 2.3 0 9
# 00 2019-01-02 2.4 3 4
# 01 2019-01-02 2.6 0 2
from bigdl.orca.data.pandas import read_csv
from bigdl.chronos.data.experimental import XShardsTSDataset
shards = read_csv(csv_path)
tsdataset = XShardsTSDataset.from_xshards(shards, dt_col="datetime",
target_col="value", id_col="id",
extra_feature_col=["extra feature 1",
"extra feature 2"])
target_col
is a list of all elements along feature dimension, while id_col
is the identifier that distinguishes the id dimension. dt_col
is the datetime column. For extra_feature_col
(not shown in this case), you should list those features that you will use as input features but not as target features (e.g. you will not perform forecasting or anomaly detection task on this col).
If you are building a prototype for your forecasting/anomaly detection task and you need to split you TSDataset to train/valid/test set, you can use with_split
parameter.TSDataset
or XShardsTSDataset
supports split with ratio by val_ratio
and test_ratio
.
If you are deploying your model in production environment, you can use deploy_mode
parameter and specify it to True
when calling TSDataset.from_pandas
, TSDataset.from_parquet
or TSDataset.from_prometheus
, which will reduce data processing latency and set necessary parameters for data processing and feature engineering.
3. Time series dataset preprocessing#
TSDataset
supports impute
, deduplicate
and resample
. You may fill the missing point by impute
in different modes. You may remove the records that are totally the same by deduplicate
. You may change the sample frequency by resample
. XShardsTSDataset
only supports impute
for now.
A typical cascade call for preprocessing is:
tsdata.deduplicate().resample(interval="2s").impute()
tsdata.impute()
4. Feature scaling#
Scaling all features to one distribution is important, especially when we want to train a machine learning/deep learning system. Scaling will make the training process much more stable. Still, we may always remember to unscale the prediction result at last.
TSDataset
and XShardsTSDataset
support all the scalers in sklearn through scale
and unscale
method.
Since a scaler should not fit, a typical call for scaling operations is is:
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
# scale
for tsdata in [tsdata_train, tsdata_valid, tsdata_test]:
tsdata.scale(scaler, fit=tsdata is tsdata_train)
# unscale
for tsdata in [tsdata_train, tsdata_valid, tsdata_test]:
tsdata.unscale()
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
# scale
scaler = {"id1": StandardScaler(), "id2": StandardScaler()}
for tsdata in [tsdata_train, tsdata_valid, tsdata_test]:
tsdata.scale(scaler, fit=tsdata is tsdata_train)
# unscale
for tsdata in [tsdata_train, tsdata_valid, tsdata_test]:
tsdata.unscale()
unscale_numpy
in TSDataset or unscale_xshards
in XShardsTSDataset is specially designed for forecasters. Users may unscale the output of a forecaster by this operation.
A typical call is:
x, y = tsdata_test.scale(scaler)\
.roll(lookback=..., horizon=...)\
.to_numpy()
yhat = forecaster.predict(x)
unscaled_yhat = tsdata_test.unscale_numpy(yhat)
unscaled_y = tsdata_test.unscale_numpy(y)
# calculate metric by unscaled_yhat and unscaled_y
x, y = tsdata_test.scale(scaler)\
.roll(lookback=..., horizon=...)\
.to_xshards()
yhat = forecaster.predict(x)
unscaled_yhat = tsdata_test.unscale_xshards(yhat)
unscaled_y = tsdata_test.unscale_xshards(y, key="y")
# calculate metric by unscaled_yhat and unscaled_y
5. Feature generation#
Other than historical target data and other extra feature provided by users, some additional features can be generated automatically by TSDataset
. gen_dt_feature
helps users to generate 10 datetime related features(e.g. MONTH, WEEKDAY, …). gen_global_feature
and gen_rolling_feature
are powered by tsfresh to generate aggregated features (e.g. min, max, …) for each time series or rolling windows respectively.
6. Sampling and exporting#
A time series dataset needs to be sampling and exporting as numpy ndarray/dataloader to be used in machine learning and deep learning models(e.g. forecasters, anomaly detectors, auto models, etc.).
Warning
You don’t need to call any sampling or exporting methods introduced in this section when using AutoTSEstimator
.
6.1 Roll sampling#
Roll sampling (or sliding window sampling) is useful when you want to train a RR type supervised deep learning forecasting model. It works as the diagram shows.
Please refer to the API doc roll
for detailed behavior. Users can simply export the sampling result as numpy ndarray by to_numpy
, pytorch dataloader to_torch_data_loader
, tensorflow dataset by to_tf_dataset
or xshards object by to_xshards
.
Note
Difference between roll
and to_torch_data_loader
:
.roll(...)
performs the rolling before RR forecasters/auto models training while .to_torch_data_loader(...)
performs rolling during the training.
It is fine to use either of them when you have a relatively small dataset (less than 1G). .to_torch_data_loader(...)
is recommended when you have a large dataset (larger than 1G) to save memory usage.
Note
Roll sampling format:
As decribed in RR style forecasting concept, the sampling result will have the following shape requirement.
Please follow the same shape if you use customized data creator.
A typical call of roll
is as following:
# forecaster
x, y = tsdata.roll(lookback=..., horizon=...).to_numpy()
forecaster.fit((x, y))
# forecaster
data = tsdata.roll(lookback=..., horizon=...).to_xshards()
forecaster.fit(data)
6.2 Pandas Exporting#
Now we support pandas dataframe exporting through to_pandas()
for users to carry out their own transformation. Here is an example of using only one time series for anomaly detection.
# anomaly detector on "target" col
x = tsdata.to_pandas()["target"].to_numpy()
anomaly_detector.fit(x)
View TSDataset API Doc for more details.
7. Built-in Dataset#
Built-in Dataset supports the function of data downloading, preprocessing, and returning to the TSDataset
object of the public data set.
Dataset name | Task | Time Series Length | Number of Instances | Feature Number | Information Page | Download Link |
---|---|---|---|---|---|---|
network_traffic | forecasting | 8760 | 1 | 2 | network_traffic | network_traffic |
nyc_taxi | forecasting | 10320 | 1 | 1 | nyc_taxi | nyc_taxi |
fsi | forecasting | 1259 | 1 | 1 | fsi | fsi |
AIOps | anomaly_detect | 61570 | 1 | 1 | AIOps | AIOps |
uci_electricity | forecasting | 140256 | 370 | 1 | uci_electricity | uci_electricity |
tsinghua_electricity | forecasting | 26304 | 321 | 1 | tsinghua_electricity | tsinghua_electricity |
Specify the name
, the raw data file will be saved in the specified path
(defaults to ~/.chronos/dataset). redownload
can help you re-download the files you need.
When with_split
is set to True, the length of the data set will be divided according to the specified val_ratio
and test_ratio
, and three TSDataset
will be returned. with_split
defaults to True, val_ratio
and test_ratio
defaults to 0.1. If you need only one TSDataset
, just specify with_split
to False.
About TSDataset
, more details, please refer to here.
# load built-in dataset
from bigdl.chronos.data import get_public_dataset
from sklearn.preprocessing import StandardScaler
tsdata_train, tsdata_val, \
tsdata_test = get_public_dataset(name='nyc_taxi',
with_split=True,
val_ratio=0.1,
test_ratio=0.1
)
# carry out additional customized preprocessing on the dataset.
stand = StandardScaler()
for tsdata in [tsdata_train, tsdata_val, tsdata_test]:
tsdata.gen_dt_feature(one_hot_features=['HOUR'])\
.impute()\
.scale(stand, fit=tsdata is tsdata_train)