View the runnable example on GitHub
Automatic Inference Context Management by get_context
#
You can use InferenceOptimizer.get_context(model=...)
API to enable automatic context management for PyTorch inference. With only one line of code change, BigDL-Nano will automatically provide suitable context management for each accelerated model optimized by InferenceOptimizer.trace
/quantize
/optimize
, it usually contains part of or all of following four types of context manager:
torch.inference_mode(True)
to disable gradients, which will be used for all models. For the case whentorch <= 1.12
,torch.no_grad()
will be used for PyTorch mixed precision inference as a replacement oftorch.inference_mode(True)
torch.cpu.amp.autocast(dtype=torch.bfloat16)
to run in mixed precision, which will be provided for bf16 related modeltorch.set_num_threads()
to control thread number, which will be used only if you specifythread_num
when applyingInferenceOptimizer.trace
/quantize
/optimize
torch.jit.enable_onednn_fusion(True)
to support ONEDNN fusion for jit when using jit as accelerator
Here we take a pretrained ResNet18 model for example.
[ ]:
import torch
from torchvision.models import resnet18
model = resnet18(pretrained=True)
InferenceOptimizer.trace#
For model accelerated by InferenceOptimizer.trace
, usage now looks like below codes, here we just take ipex
for example.
[3]:
from bigdl.nano.pytorch import InferenceOptimizer
ipex_model = InferenceOptimizer.trace(model,
use_ipex=True,
thread_num=4)
input_sample = torch.rand(1, 3, 224, 224)
with InferenceOptimizer.get_context(ipex_model):
output = ipex_model(input_sample)
assert torch.get_num_threads() == 4 # this line just to let you know Nano has provided thread control automatically : )
InferenceOptimizer.quantize#
For model accelerated by InferenceOptimizer.quantize
, usage now looks like below codes, here we just take bf16 + channels_last
for example.
[5]:
from bigdl.nano.pytorch import InferenceOptimizer
bf16_model = InferenceOptimizer.quantize(model,
precision='bf16',
channels_last=True,
thread_num=4)
input_sample = torch.rand(1, 3, 224, 224)
with InferenceOptimizer.get_context(bf16_model):
output = bf16_model(input_sample)
assert torch.get_num_threads() == 4 # this line just to let you know Nano has provided thread control automatically : )
assert output.dtype == torch.bfloat16 # this line just to let you know Nano has provided autocast context manager automatically : )
InferenceOptimizer.optimize#
By calling optimize()
, you will get bunchs of accelerated models at the same time, then you can obtain the model you want by InferenceOptimizer.get_model
or InferenceOptimizer.get_best_model
. Usage looks like below codes, here we just take openvino
for example.
[ ]:
# To obtain the latency of single sample, set batch_size=1
train_dataloader = DataLoader(train_dataset, batch_size=1)
val_dataloader = DataLoader(val_dataset)
from bigdl.nano.pytorch import InferenceOptimizer
optimizer = InferenceOptimizer()
optimizer.optimize(model=model,
training_data=train_dataloader,
thread_num=4,
latency_sample_num=30)
[9]:
openvino_model = optimizer.get_model("openvino_fp32")
input_sample = torch.rand(1, 3, 224, 224)
with InferenceOptimizer.get_context(openvino_model):
output = openvino_model(input_sample)
assert torch.get_num_threads() == 4 # this line just to let you know Nano has provided thread control automatically : )
[10]:
accelerated_model, option = optimizer.get_best_model()
input_sample = torch.rand(1, 3, 224, 224)
with InferenceOptimizer.get_context(accelerated_model):
output = accelerated_model(input_sample)
assert torch.get_num_threads() == 4 # this line just to let you know Nano has provided thread control automatically : )
Advanced Usage: Multiple Models#
InferenceOptimizer.get_context(model=...)
can be used for muitiple models. If you have a model pipeline, you can also get a common context manager by passing multiple models to get_context
.
📝 Note
Here are some rules that how we solve conflict between multiple context managers:
If two context managers have difference precision (bf16 and non bf16), we will return AutocastContextManager()
If only one context manager have thread_num, we will set thread_num to that value
If two context managers have different thread_num, we will set thread_num to the larger one
Here is a simple example just to explain the usage for pipeline:
[17]:
from torch import nn
class Classifier(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1000, 1)
def forward(self, x):
return self.linear(x)
classifer = Classifier()
with InferenceOptimizer.get_context(ipex_model, classifer):
# a pipeline consists of backbone and classifier
x = ipex_model(input_sample)
output = classifer(x)
assert torch.get_num_threads() == 4 # this line just to let you know Nano has provided thread control automatically : )
📚 Related Readings