View the runnable example on GitHub

# Automatic Inference Context Management by get_context#

You can use InferenceOptimizer.get_context(model=...) API to enable automatic context management for PyTorch inference. With only one line of code change, BigDL-Nano will automatically provide suitable context management for each accelerated model optimized by InferenceOptimizer.trace/quantize/optimize, it usually contains part of or all of following four types of context manager:

1. torch.inference_mode(True) to disable gradients, which will be used for all models. For the case when torch <= 1.12, torch.no_grad() will be used for PyTorch mixed precision inference as a replacement of torch.inference_mode(True)

2. torch.cpu.amp.autocast(dtype=torch.bfloat16) to run in mixed precision, which will be provided for bf16 related model

3. torch.set_num_threads() to control thread number, which will be used only if you specify thread_num when applying InferenceOptimizer.trace/quantize/optimize

4. torch.jit.enable_onednn_fusion(True) to support ONEDNN fusion for jit when using jit as accelerator

Here we take a pretrained ResNet18 model for example.

[ ]:

import torch
from torchvision.models import resnet18

model = resnet18(pretrained=True)


## InferenceOptimizer.trace#

For model accelerated by InferenceOptimizer.trace, usage now looks like below codes, here we just take ipex for example.

[3]:

from bigdl.nano.pytorch import InferenceOptimizer
ipex_model = InferenceOptimizer.trace(model,
use_ipex=True,
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(ipex_model):
output = ipex_model(input_sample)
assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )


## InferenceOptimizer.quantize#

For model accelerated by InferenceOptimizer.quantize, usage now looks like below codes, here we just take bf16 + channels_last for example.

[5]:

from bigdl.nano.pytorch import InferenceOptimizer
bf16_model = InferenceOptimizer.quantize(model,
precision='bf16',
channels_last=True,
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(bf16_model):
output = bf16_model(input_sample)
assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )
assert output.dtype == torch.bfloat16  # this line just to let you know Nano has provided autocast context manager automatically : )


## InferenceOptimizer.optimize#

By calling optimize(), you will get bunchs of accelerated models at the same time, then you can obtain the model you want by InferenceOptimizer.get_model or InferenceOptimizer.get_best_model. Usage looks like below codes, here we just take openvino for example.

[ ]:

# To obtain the latency of single sample, set batch_size=1

from bigdl.nano.pytorch import InferenceOptimizer
optimizer = InferenceOptimizer()
optimizer.optimize(model=model,
latency_sample_num=30)

[9]:

openvino_model = optimizer.get_model("openvino_fp32")
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(openvino_model):
output = openvino_model(input_sample)
assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )

[10]:

accelerated_model, option = optimizer.get_best_model()
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(accelerated_model):
output = accelerated_model(input_sample)
assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )


InferenceOptimizer.get_context(model=...) can be used for muitiple models. If you have a model pipeline, you can also get a common context manager by passing multiple models to get_context.

📝 Note

Here are some rules that how we solve conflict between multiple context managers:

1. If two context managers have difference precision (bf16 and non bf16), we will return AutocastContextManager()

2. If only one context manager have thread_num, we will set thread_num to that value

3. If two context managers have different thread_num, we will set thread_num to the larger one

Here is a simple example just to explain the usage for pipeline:

[17]:

from torch import nn

class Classifier(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1000, 1)

def forward(self, x):
return self.linear(x)

classifer = Classifier()

with InferenceOptimizer.get_context(ipex_model, classifer):
# a pipeline consists of backbone and classifier
x = ipex_model(input_sample)
output = classifer(x)
assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )