Speed up inference of forecaster through ONNXRuntime#
In the inferencing process, it is desirable to speed up. One way to do this is to utilize some accelerators, such as ONNXRuntime.
Actually, utilizing ONNXRuntime to accelerate is easy in Chronos, that is directly calling
build_onnx). In this guidance, we demonstrate how to speed up inference of forecaster through ONNXRuntime in detail.
We will take
TCNForecaster and nyc_taxi dataset as an example in this guide.
Before we begin, we need to install chronos if it isn’t already available, we choose to use pytorch as deep learning backend.
!pip install --pre --upgrade bigdl-chronos[pytorch] # install ONNXRuntime !pip install onnx !pip install onnxruntime # uninstall torchtext to avoid version conflict !pip uninstall -y torchtext
Although Chronos supports inferencing on a cluster, the method to speed up can only be used when forecaster is a non-distributed version.
Only pytorch backend deep learning forecasters support onnxruntime acceleration.
Before the inferencing process, a forecaster should be created and trained. The training process is introduced in the previous guidance Train forcaster on single node in detail, therefore we directly create and train a
TCNForecaster based on the nyc taxi dataset.
Speeding up inference#
When a trained forecaster is ready and forecaster is a non-distributed version, we provide with
predict_with_onnx method to speed up inference. The method can be directly called without calling
build_onnx and forecaster will automatically build an onnxruntime session with default settings.
build_onnxis recommended to use in following cases:
To strictly control the thread to be used during inferencing.
To alleviate the cold start problem when
predict_with_onnxis called for the first time.
Please refer to API documentation for more information on
predict_with_onnx method supports data in following formats:
numpy ndarray (recommended)
And there are
quantize parameters you may want to change. If not familiar with manual hyperparameters tuning, just leave
batch_size to the default value. Additionally,
quantize can be set to
True to use the quantized onnx model to predict.
# get data for training and testing train_data, test_data = get_data() # get a trained forecaster forecaster = get_trained_forecaster(train_data)
# speed up inference through ONNXRuntime for x, y in test_data: yhat = forecaster.predict_with_onnx(x.numpy()) # predict
Let’s see the acceleration performance of
The predict latency of without accelerator and with ONNXRuntime are given below. The result “p50” means latency sorted to 50% in multiple predictions and the acceleration performance is significant.
from bigdl.chronos.metric import Evaluator x = next(iter(test_data)) def func_original(): forecaster.predict(x.numpy()) # without accelerator def func_onnxruntime(): forecaster.predict_with_onnx(x.numpy()) # with ONNXRuntime print("original predict runtime (ms):", Evaluator.get_latency(func_original)) print("pridict runtime with ONNXRuntime (ms):", Evaluator.get_latency(func_onnxruntime))