View the runnable example on GitHub

OpenVINO Asynchronous Inference using Nano API#

You can use async_predict method in OpenVINOModel class in Nano to do asynchronous inference on an OpenVINO model. It only takes a few lines.

To run asynchronous inference on OpenVINO model with Nano, the following dependencies need to be installed first:

[ ]:
# for BigDL-Nano
!pip install --pre --upgrade bigdl-nano # install the nightly-built version
!source bigdl-nano-init

# for OpenVINO
!pip install openvino-dev

📝 Note

We recommend to run the commands above, especially source bigdl-nano-init before jupyter kernel is started, or some of the optimizations may not take effect.

Let’s take a resnet18-xnor-binary-onnx-0001 model pretrained on ImageNet dataset from the Open Model Zoo as an example. First, we download the model using omz_downloader:

[ ]:
!omz_downloader --name resnet18-xnor-binary-onnx-0001 -o ./model

First, load the model using OpenVINOModel class.

from bigdl.nano.openvino import OpenVINOModel

ov_model = OpenVINOModel("model/intel/resnet18-xnor-binary-onnx-0001/FP16-INT1/resnet18-xnor-binary-onnx-0001.xml")

To run asynchronous inference on OpenVINO model, the only change you need to make is to prepare a list of input data and call ov_model.async_predict(input_data, num_requests):

[ ]:
import numpy as np

input_data = [np.random.randn(1, 3, 224, 224) for i in range(5)]
async_results = ov_model.async_predict(input_data=input_data, num_requests=5)
for res in async_results:
    predictions = res.argmax(axis=1)

📝 Note

async_predict accepts multiple groups of input data in a list, and each group of data will be inferenced using an asynchronous infer request, and a list containing the result of each infer request will be retured. If you have multiple groups of input data to inference, async_predict will achieve better performance than sync inference using ov_model(x).

You can specify the number of asynchronous infer request in num_requests, if num_requests is set to 0, the value will be set automatically to the optimal number.

In the code above, we have 5 groups of input data and create 5 asynchronous infer requests. When async_predict is called, each asynchronous infer request will run inference in a parallel pipeline.