View the runnable example on GitHub
Quantize PyTorch Model in INT8 for Inference using OpenVINO Post-training Optimization Tools#
As Post-training Optimization Tools (POT) is provided by OpenVINO toolkit, OpenVINO acceleration will be enabled in the meantime when using POT for INT8 quantization. You can call InferenceOptimizer.quantize
API with accelerator='openvino'
(and precision='int8'
) to use POT for your PyTorch nn.Module
. It only takes a few lines.
Let’s take an ResNet-18 model pretrained on ImageNet dataset and finetuned on OxfordIIITPet dataset as an example:
[ ]:
from torchvision.models import resnet18
model = resnet18(pretrained=True)
_, train_dataset, val_dataset = finetune_pet_dataset(model)
The full definition of function finetune_pet_dataset
could be found in the runnable example.
To enable INT8 quantization using POT for inference, you could simply import BigDL-Nano InferenceOptimizer
, and use InferenceOptimizer
to quantize your PyTorch model with accelerator='openvino'
:
[ ]:
from bigdl.nano.pytorch import InferenceOptimizer
q_model = InferenceOptimizer.quantize(model,
accelerator='openvino',
calib_data=DataLoader(train_dataset, batch_size=32))
📝 Note
The
InferenceOptimizer.quantize
function has aprecision
parameter to specify the precision for quantization. It is default to be'int8'
. So, we omit theprecision
parameter here for INT8 quantization.For IN8 quantization using POT, only static post-training quantization is supported. So
calib_data
(for calibration data) is always required whenaccelerator='openvino'
.For
calib_data
, batch size is not important as it intends to read 100 samples. And there could be no label in calibration data.Please refer to API documentation for more information on
InferenceOptimizer.quantize
.
You could then do the normal inference steps under the context manager provided by Nano, with the quantized model:
[ ]:
with InferenceOptimizer.get_context(q_model):
x = torch.stack([val_dataset[0][0], val_dataset[1][0]])
# use the quantized model here
y_hat = q_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
📝 Note
For all Nano optimized models by
InferenceOptimizer.quantize
, you need to wrap the inference steps with an automatic context managerInferenceOptimizer.get_context(model=...)
provided by Nano. You could refer to here for more detailed usage of the context manager.