View the runnable example on GitHub
Quantize PyTorch Model for Inference using Intel Neural Compressor¶
With Intel Neural Compressor (INC) as quantization engine, you can apply InferenceOptimizer.quantize
API to realize post-training quantization on your PyTorch nn.Module
. InferenceOptimizer.quantize
also supports ONNXRuntime acceleration at the meantime through specifying accelerator='onnxruntime'
. All acceleration takes only a few lines.
Let’s take an ResNet-18 model pretrained on ImageNet dataset and finetuned on OxfordIIITPet dataset as an example:
[ ]:
from torchvision.models import resnet18
model = resnet18(pretrained=True)
_, train_dataset, val_dataset = finetune_pet_dataset(model)
The full definition of function finetune_pet_dataset
could be found in the runnable example.
Then we set the model in evaluation mode:
[ ]:
model.eval()
To enable quantization using INC for inference, you could simply import BigDL-Nano InferenceOptimizer
, and use InferenceOptimizer
to quantize your PyTorch model:
[ ]:
from bigdl.nano.pytorch import InferenceOptimizer
q_model = InferenceOptimizer.quantize(model,
calib_dataloader=DataLoader(train_dataset, batch_size=32))
If you want to enable the ONNXRuntime acceleration at the meantime, you could just specify the accelerator
parameter:
[ ]:
from bigdl.nano.pytorch import InferenceOptimizer
q_model = InferenceOptimizer.quantize(model,
accelerator='onnxruntime',
calib_dataloader=DataLoader(train_dataset, batch_size=32))
📝 Note
InferenceOptimizer
will by default quantize your PyTorchnn.Module
through static post-training quantization. For this case,calib_dataloader
(for calibration data) is required. Batch size is not important tocalib_dataloader
, as it intends to read 100 samples. And there could be no label in calibration data.If you would like to implement dynamic post-training quantization, you could set parameter
approach='dynamic'
. In this case,calib_dataloader
should beNone
. Compared to dynamic quantization, static quantization could lead to faster inference as it eliminates the data conversion costs between layers.Please refer to API documentation for more information on
InferenceOptimizer.quantize
.
You could then do the normal inference steps with the quantized model:
[ ]:
x = torch.stack([val_dataset[0][0], val_dataset[1][0]])
# use the quantized model here
y_hat = q_model(x)
predictions = y_hat.argmax(dim=1)
print(predictions)
📚 Related Readings