Quantize PyTorch Model in INT8 for Inference using Intel Neural Compressor#

With Intel Neural Compressor (INC) as quantization engine, you can apply InferenceOptimizer.quantize API to realize INT8 post-training quantization on your PyTorch nn.Module. InferenceOptimizer.quantize also supports ONNXRuntime acceleration at the meantime through specifying accelerator='onnxruntime'. All acceleration takes only a few lines.

Let’s take an ResNet-18 model pretrained on ImageNet dataset and finetuned on OxfordIIITPet dataset as an example:

[ ]:

from torchvision.models import resnet18

model = resnet18(pretrained=True)
_, train_dataset, val_dataset = finetune_pet_dataset(model)

The full definition of function finetune_pet_dataset could be found in the runnable example.

To enable INT8 quantization using INC for inference, you could simply import BigDL-Nano InferenceOptimizer, and use InferenceOptimizer to quantize your PyTorch model:

[ ]:

from bigdl.nano.pytorch import InferenceOptimizer

q_model = InferenceOptimizer.quantize(model,
                                      calib_data=DataLoader(train_dataset, batch_size=32))

If you want to enable the ONNXRuntime acceleration at the meantime, you could just specify the accelerator parameter:

[ ]:

from bigdl.nano.pytorch import InferenceOptimizer

q_model = InferenceOptimizer.quantize(model,
                                      accelerator='onnxruntime',
                                      calib_data=DataLoader(train_dataset, batch_size=32))

📝 Note

The InferenceOptimizer.quantize function has a precision parameter to specify the precision for quantization. It is default to be 'int8'. So, we omit the precision parameter here for INT8 quantization.

During INT8 quantization using INC, InferenceOptimizer will by default quantize your PyTorch nn.Module through static post-training quantization. For this case, calib_data (for calibration data) is required. Batch size is not important to calib_data, as it intends to read 100 samples. And there could be no label in calibration data.

If you would like to implement dynamic post-training quantization, you could set parameter approach='dynamic'. In this case, calib_dataloader should be None. Compared to dynamic quantization, static quantization could lead to faster inference as it eliminates the data conversion costs between layers.

Please refer to API documentation for more information on InferenceOptimizer.quantize.

You could then do the normal inference steps under the context manager provided by Nano, with the quantized model:

[ ]:

with InferenceOptimizer.get_context(q_model):
    x = torch.stack([val_dataset[0][0], val_dataset[1][0]])
    # use the quantized model here
    y_hat = q_model(x)
    predictions = y_hat.argmax(dim=1)
    print(predictions)

📝 Note

For all Nano optimized models by InferenceOptimizer.quantize, you need to wrap the inference steps with an automatic context manager InferenceOptimizer.get_context(model=...) provided by Nano. You could refer to here for more detailed usage of the context manager.

📚 Related Readings

How to install BigDL-Nano

How to enable automatic context management for PyTorch inference on Nano optimized models