Open In Colab

image.png

Preprocess my own data#

Introduction#

As we all know, data is the core part of machine learning and deep learning, and how we preprocess our own data greatly influence the results of training.

Usually, there maybe some low quality data like wrong datetime type, missing value, inconsitent time interval in our own data. We need to fix these problems for better training. Also we may need to do some feature engineering work for better accuracy.

TSDataset provides a bunch of functions to help you deal with above cases. In this guidance, we demonstrate how to preprocess my own data in detail.

We will take random dataframe as an example in this guide.

Setup#

Before we begin, we need to install chronos if it isn’t already available, we choose to use pytorch as deep learning backend.

[ ]:
!pip install --pre --upgrade bigdl-chronos[pytorch]
!pip uninstall -y torchtext

Initialize TSDataset#

Here we take a random pandas dataframe df as an example, you can see the first five samples of this dateframe:

[9]:
df.head()
[9]:
a b c d e datetime id
0 NaN NaN NaN 0.439440 0.741009 2019-01-01 00:00:00 00
1 0.595070 0.030522 0.739803 0.452446 0.593899 2019-01-02 00:00:00 00
2 0.032905 NaN NaN NaN 0.685142 2019-01-03 00:00:00 00
3 0.517125 0.131058 0.491102 NaN NaN 1/2/2019 00
4 0.116797 0.270899 NaN 0.876024 0.980061 2019-01-05 00:00:00 00

First you should initialize a TSDataset by TSDataset.from_pandas and now we provide automatic quality check during the initialization process. Therefore, if :

  • your datetime column has a wrong type

  • there exsits missing value in your data

  • there are inconsitent time interval in your data

You will see warnings after initialization.

[ ]:
from bigdl.chronos.data import TSDataset

tsdata_train, _, tsdata_test = TSDataset.from_pandas(df,
                                                     id_col="id",
                                                     dt_col="datetime",
                                                     target_col=['a', 'b', 'c', 'd', 'e'],
                                                     extra_feature_col=None,
                                                     with_split=True)

📝Note

  • If your data is a parquet file, you should call TSDataset.from_parquet.

  • If your data is stored in Prometheus, you should call TSDataset.from_prometheus.

Preprocess data#

There are two ways for you to preprocess your data: automatic repair or preprocess manually. You can combine the two ways together or just choose the second way to preprocess your data.

Automatic repair#

During your initialzation, if you see warnings, you can choose automatic data repair by setting repair=True in initialzation, which will:

  • change your datetime column to datetime64 type

  • resample your data based on mode of time interval

  • fill your missing data

[ ]:
from bigdl.chronos.data import TSDataset

tsdata_train, _, tsdata_test = TSDataset.from_pandas(df,
                                                     id_col="id",
                                                     dt_col="datetime",
                                                     target_col=['a', 'b', 'c', 'd', 'e'],
                                                     extra_feature_col=None,
                                                     with_split=True,
                                                     repair=True)

Below is the repaired dataframe, as you can see, time interval is consitent now and missing values have been filled.

[12]:
tsdata_train.df.head()
[12]:
datetime a b c d e id
0 2019-01-01 0.556098 0.080790 0.615452 0.439440 0.741009 00
1 2019-01-02 0.556098 0.080790 0.615452 0.452446 0.593899 00
2 2019-01-03 0.032905 0.144160 0.674160 0.593639 0.685142 00
3 2019-01-04 0.074851 0.207529 0.732868 0.734831 0.832602 00
4 2019-01-05 0.116797 0.270899 0.791577 0.876024 0.980061 00

Preprocess data manually#

TSDataset provides a bunch of functions to help you process your data:

  • impute : fill in missing value

  • resample : resample on a new interval for each univariate time series distinguished by id_col and feature_col

  • deduplicate: remove those duplicated records

  • scale: scale the time series dataset’s feature column and target column according to a sklearn scaler instance

And theses method can be cascaded. A common processing method is as follows:

[13]:
from sklearn.preprocessing import StandardScaler
stand = StandardScaler()

for tsdata in [tsdata_train, tsdata_test]:
    tsdata.deduplicate().impute()\
            .scale(stand, fit=tsdata is tsdata_train)

You can also do some feature engineering work. For example, TSDataset provides gen_gt_feature to generate datetime feature(s) for each record.

[ ]:
tsdata_train.gen_dt_feature()
tsdata_test.gen_dt_feature()

Now take a look at the new dataframe again. Obviously value range has changed by scale and new datetime features (DAY, DAYODYEAR, WEEKOFYEAR, MONTH, YEAR, IS_WEEKEND) have been added by gen_dt_feature.

[15]:
tsdata_train.df.head()
[15]:
datetime a b c d e id DAY DAYOFYEAR WEEKDAY WEEKOFYEAR MONTH YEAR IS_WEEKEND
0 2019-01-01 0.280584 -1.448139 0.272633 -0.284876 0.829947 00 1 1 1 1 1 2019 0
1 2019-01-02 0.280584 -1.448139 0.272633 -0.235544 0.282492 00 2 2 2 1 1 2019 0
2 2019-01-03 -2.089437 -1.201914 0.501847 0.300001 0.622045 00 3 3 3 1 1 2019 0
3 2019-01-04 -1.899425 -0.955689 0.731060 0.835545 1.170800 00 4 4 4 1 1 2019 0
4 2019-01-05 -1.709413 -0.709464 0.960274 1.371090 1.719554 00 5 5 5 1 1 2019 1

Then you have finished basic data preprocessing. The next step, for deep learning, you should roll your data according to lookback and horizon, which will be introduced in another guidance in detail.