Automatic Inference Context Management by `get_context`#

You can use InferenceOptimizer.get_context(model=...) API to enable automatic context management for PyTorch inference. With only one line of code change, BigDL-Nano will automatically provide suitable context management for each accelerated model optimized by InferenceOptimizer.trace/quantize/optimize, it usually contains part of or all of following four types of context manager:

torch.inference_mode(True) to disable gradients, which will be used for all models. For the case when torch <= 1.12, torch.no_grad() will be used for PyTorch mixed precision inference as a replacement of torch.inference_mode(True)
torch.cpu.amp.autocast(dtype=torch.bfloat16) to run in mixed precision, which will be provided for bf16 related model
torch.set_num_threads() to control thread number, which will be used only if you specify thread_num when applying InferenceOptimizer.trace/quantize/optimize
torch.jit.enable_onednn_fusion(True) to support ONEDNN fusion for jit when using jit as accelerator

Here we take a pretrained ResNet18 model for example.

[ ]:

import torch
from torchvision.models import resnet18

model = resnet18(pretrained=True)

InferenceOptimizer.trace#

For model accelerated by InferenceOptimizer.trace, usage now looks like below codes, here we just take ipex for example.

[3]:

from bigdl.nano.pytorch import InferenceOptimizer
ipex_model = InferenceOptimizer.trace(model,
                                      use_ipex=True,
                                      thread_num=4)
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(ipex_model):
    output = ipex_model(input_sample)
    assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )

InferenceOptimizer.quantize#

For model accelerated by InferenceOptimizer.quantize, usage now looks like below codes, here we just take bf16 + channels_last for example.

[5]:

from bigdl.nano.pytorch import InferenceOptimizer
bf16_model = InferenceOptimizer.quantize(model,
                                         precision='bf16',
                                         channels_last=True,
                                         thread_num=4)
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(bf16_model):
    output = bf16_model(input_sample)
    assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )
    assert output.dtype == torch.bfloat16  # this line just to let you know Nano has provided autocast context manager automatically : )

InferenceOptimizer.optimize#

By calling optimize(), you will get bunchs of accelerated models at the same time, then you can obtain the model you want by InferenceOptimizer.get_model or InferenceOptimizer.get_best_model. Usage looks like below codes, here we just take openvino for example.

[ ]:

# To obtain the latency of single sample, set batch_size=1
train_dataloader = DataLoader(train_dataset, batch_size=1)
val_dataloader = DataLoader(val_dataset)

from bigdl.nano.pytorch import InferenceOptimizer
optimizer = InferenceOptimizer()
optimizer.optimize(model=model,
                   training_data=train_dataloader,
                   thread_num=4,
                   latency_sample_num=30)

[9]:

openvino_model = optimizer.get_model("openvino_fp32")
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(openvino_model):
    output = openvino_model(input_sample)
    assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )

[10]:

accelerated_model, option = optimizer.get_best_model()
input_sample = torch.rand(1, 3, 224, 224)

with InferenceOptimizer.get_context(accelerated_model):
    output = accelerated_model(input_sample)
    assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )

Advanced Usage: Multiple Models#

InferenceOptimizer.get_context(model=...) can be used for muitiple models. If you have a model pipeline, you can also get a common context manager by passing multiple models to get_context.

📝 Note

Here are some rules that how we solve conflict between multiple context managers:

If two context managers have difference precision (bf16 and non bf16), we will return AutocastContextManager()

If only one context manager have thread_num, we will set thread_num to that value

If two context managers have different thread_num, we will set thread_num to the larger one

Here is a simple example just to explain the usage for pipeline:

[17]:

from torch import nn

class Classifier(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = nn.Linear(1000, 1)

    def forward(self, x):
        return self.linear(x)

classifer = Classifier()

with InferenceOptimizer.get_context(ipex_model, classifer):
    # a pipeline consists of backbone and classifier
    x = ipex_model(input_sample)
    output = classifer(x)
    assert torch.get_num_threads() == 4  # this line just to let you know Nano has provided thread control automatically : )

📚 Related Readings

How to install BigDL-Nano

Automatic Inference Context Management by get_context#

InferenceOptimizer.trace#

InferenceOptimizer.quantize#

InferenceOptimizer.optimize#

Advanced Usage: Multiple Models#

Automatic Inference Context Management by `get_context`#