Accelerate Inference on Intel GPUs Using OpenVINO#

You can use InferenceOptimizer.trace(..., accelerator='openvino', device='GPU') to enable the OpenVINO acceleration for inference on Intel GPUs, both integrated and discrete ones. BigDL-Nano also supports quantization with OpenVINO accelerator on Intel GPUs by InferenceOptimizer.quantize(..., accelerator='openvino', device='GPU', precision='fp16'/'int8'). It only takes a few lines.

Before starting this guide, below codes can help you search the available Intel GPU devices on you machine, and you can inference on any one of them.

[ ]:

from openvino.runtime import Core
core = Core()
print(core.available_devices)

The function returns a list of available devices:

output	corresponding GPU device(s)
`GPU`	alias for `GPU.0`, integrated GPU
`GPU.X`	enumeration of GPUs, `X` - id of the GPU device
`GPU.X.Y`	specific tile in a multi-tile architecture, `X` - id of the GPU device, `Y` - id of the tile within device `X`

For more information around the device naming convention of openvino, you can refer to this page.

PyTorch example#

Let’s take a ResNet-18 model pretrained on ImageNet dataset as an example. Note that you don’t have to transfer the model to GPU and set it to evaluation mode since InferenceOptimizer will handle these automatically.

[ ]:

from torchvision.models import resnet18

pt_model = resnet18(pretrained=True)
_, train_dataset, val_dataset = finetune_pet_dataset(pt_model)

The full definition of function finetune_pet_dataset could be found in the runnable example.

To enable OpenVINO acceleration for your PyTorch inference pipeline on Intel GPUs, the only change you need to made is to import BigDL-Nano InferenceOptimizer, and trace your PyTorch model to convert it into an OpenVINO accelerated module for inference, with specifying device='GPU'.

📝 Note

By setting device to 'GPU', inference will be conducted on the default Intel GPU device. You can change to other devices ('GPU.X' / 'GPU.X.Y') instead.

[ ]:

import torch
from bigdl.nano.pytorch import InferenceOptimizer

ov_model = InferenceOptimizer.trace(pt_model,
                                    accelerator="openvino",
                                    input_sample=torch.rand(1, 3, 224, 224),
                                    device='GPU')

📝 Note

input_sample is the parameter for OpenVINO accelerator to know the shape of the model input. So both the batch size and the specific values are not important to input_sample. If we want our test dataset consists of images with \(224 \times 224\) pixels, we could use torch.rand(1, 3, 224, 224) for input_sample here.

Please refer to API documentation for more information on InferenceOptimizer.trace.

If you want to quantize your model by using OpenVINO Post-training Optimization Tools, you could call InferenceOptimizer.quantize.

For FP16 quantization:

[ ]:

import torch
from bigdl.nano.pytorch import InferenceOptimizer

ov_model = InferenceOptimizer.quantize(pt_model,
                                       accelerator='openvino',
                                       input_sample=torch.rand(1, 3, 224, 224),
                                       device='GPU',
                                       precision='fp16')

For INT8 quantization:

[ ]:

import torch
from bigdl.nano.pytorch import InferenceOptimizer

ov_model = InferenceOptimizer.quantize(pt_model,
                                       accelerator='openvino',
                                       input_sample=torch.rand(1, 3, 224, 224),
                                       device='GPU',
                                       precision='int8',
                                       calib_data=DataLoader(train_dataset, batch_size=32))

📝 Note

For INT8 quantization, we adopt the Post-training Optimization Tools provided by OpenVINO toolkit, which only supports static post-training quantization. So calib_data (calibration data) is always required when accelerator='openvino'. Here batch size is not important as it intends to read 100 samples. And there could be no label in calibration data.

Please refer to API documentation for more information on InferenceOptimizer.quantize.

You could then do the normal inference steps with the model optimized by OpenVINO:

[ ]:

with InferenceOptimizer.get_context(ov_model):
    x = torch.rand(2, 3, 224, 224)
    # use the optimized model here
    y_hat = ov_model(x)
    predictions = y_hat.argmax(dim=1)
    print(predictions)

TensorFlow example#

Let’s take MobileNetV2 as an example. Note that you don’t have to transfer the model to GPU at this step since InferenceOptimizer will handle this automatically.

[ ]:

import tensorflow as tf
from tensorflow.keras.applications.mobilenet_v2 import MobileNetV2
import numpy as np

tf_model = MobileNetV2(weights=None, input_shape=[40, 40, 3], classes=10)

train_examples = np.random.random((100, 40, 40, 3))
train_labels = np.random.randint(0, 10, size=(100,))
train_dataset = tf.data.Dataset.from_tensor_slices((train_examples, train_labels))

To enable OpenVINO acceleration for your TensorFlow inference pipeline on Intel GPUs, the only change you need to made is to import BigDL-Nano InferenceOptimizer, and trace your TensorFlow model to convert it into an OpenVINO accelerated module for inference, with specifying device='GPU'.

📝 Note

By setting device to 'GPU', inference will be conducted on the default Intel GPU device. You can change to other devices ('GPU.X' / 'GPU.X.Y') instead.

[ ]:

from bigdl.nano.tf.keras import InferenceOptimizer

ov_model = InferenceOptimizer.trace(tf_model,
                                    accelerator="openvino",
                                    device='GPU')

If you want to quantize your model by using OpenVINO Post-training Optimization Tools, you could call InferenceOptimizer.quantize.

For FP16 quantization:

[ ]:

from bigdl.nano.tf.keras import InferenceOptimizer

ov_model = InferenceOptimizer.quantize(tf_model,
                                       accelerator='openvino',
                                       device='GPU',
                                       precision='fp16')

For INT8 quantization:

[ ]:

from bigdl.nano.tf.keras import InferenceOptimizer

ov_model = InferenceOptimizer.quantize(tf_model,
                                       accelerator='openvino',
                                       device='GPU',
                                       precision='int8',
                                       x=train_dataset)

📝 Note

For INT8 quantization, we adopt the Post-training Optimization Tools provided by OpenVINO toolkit, which only supports static post-training quantization. So x (serves as calibration data) is always required when accelerator='openvino'. And there could be no label in calibration data.

Please refer to API documentation for more information on InferenceOptimizer.quantize.

You could then do the normal inference steps with the model optimized by OpenVINO:

[ ]:

x = tf.random.normal(shape=(100, 40, 40, 3))
# use the optimized model here
y_hat = ov_model(x)
predictions = tf.argmax(y_hat, axis=1)
print(predictions)

📚 Related Readings

How to install BigDL-Nano

How to install BigDL-Nano in Google Colab