View the runnable example on GitHub
Quantize PyTorch Model in INT8 for Inference using Intel Neural Compressor#
With Intel Neural Compressor (INC) as quantization engine, you can apply InferenceOptimizer.quantize
API to realize INT8 post-training quantization on your PyTorch nn.Module
. InferenceOptimizer.quantize
also supports ONNXRuntime acceleration at the meantime through specifying accelerator='onnxruntime'
. All acceleration takes only a few lines.
Let’s take an ResNet-18 model pretrained on ImageNet dataset and finetuned on OxfordIIITPet dataset as an example:
[ ]:
from torchvision.models import resnet18
model = resnet18(pretrained=True)
_, train_dataset, val_dataset = finetune_pet_dataset(model)
The full definition of function finetune_pet_dataset
could be found in the runnable example.
To enable INT8 quantization using INC for inference, you could simply import BigDL-Nano InferenceOptimizer
, and use InferenceOptimizer
to quantize your PyTorch model:
[ ]:
from bigdl.nano.pytorch import InferenceOptimizer
q_model = InferenceOptimizer.quantize(model,
calib_data=DataLoader(train_dataset, batch_size=32))
If you want to enable the ONNXRuntime acceleration at the meantime, you could just specify the accelerator
parameter:
[ ]:
from bigdl.nano.pytorch import InferenceOptimizer
q_model = InferenceOptimizer.quantize(model,
accelerator='onnxruntime',
calib_data=DataLoader(train_dataset, batch_size=32))
📝 Note
The
InferenceOptimizer.quantize
function has aprecision
parameter to specify the precision for quantization. It is default to be'int8'
. So, we omit theprecision
parameter here for INT8 quantization.During INT8 quantization using INC,
InferenceOptimizer
will by default quantize your PyTorchnn.Module
through static post-training quantization. For this case,calib_data
(for calibration data) is required. Batch size is not important tocalib_data
, as it intends to read 100 samples. And there could be no label in calibration data.If you would like to implement dynamic post-training quantization, you could set parameter
approach='dynamic'
. In this case,calib_dataloader
should beNone
. Compared to dynamic quantization, static quantization could lead to faster inference as it eliminates the data conversion costs between layers.Please refer to API documentation for more information on
InferenceOptimizer.quantize
.
You could then do the normal inference steps under the context manager provided by Nano, with the quantized model:
[ ]:
with InferenceOptimizer.get_context(q_model):
x = torch.stack([val_dataset[0][0], val_dataset[1][0]])
# use the quantized model here
y_hat = q_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
📝 Note
For all Nano optimized models by
InferenceOptimizer.quantize
, you need to wrap the inference steps with an automatic context managerInferenceOptimizer.get_context(model=...)
provided by Nano. You could refer to here for more detailed usage of the context manager.