View the runnable example on GitHub

Use BFloat16 Mixed Precision for PyTorch Training#

Brain Floating Point Format (BFloat16) is a custom 16-bit floating point format designed for machine learning. BFloat16 is comprised of 1 sign bit, 8 exponent bits, and 7 mantissa bits. With the same number of exponent bits, BFloat16 has the same dynamic range as FP32, but requires only half the memory usage.

BFloat16 Mixed Precision combines BFloat16 and FP32 during training, which could lead to increased performance and reduced memory usage. Compared to FP16 mixed precision, BFloat16 mixed precision has better numerical stability.

By using TorchNano (bigdl.nano.pytorch.TorchNano), you can make very few code changes to use BFloat16 mixed precision for training. Here we provide 2 ways to achieve this: A) subclass TorchNano or B) use @nano decorator. You can choose the appropriate one depending on your (preferred) code structure.

📝 Note

Before starting your PyTorch application, it is highly recommended to run source bigdl-nano-init to set several environment variables based on your current hardware. Empirically, these variables will greatly improve performance for most PyTorch applications on training workloads.

⚠️ Warning

Using BFloat16 precision with torch < 1.12 may result in extremely slow training.

A) Subclass TorchNano#

In general, two steps are required if you choose to subclass TorchNano:

  1. import and subclass TorchNano, and override its train() method

  2. instantiate it with setting precision='bf16', then call the train() method

For step 1, you can refer to this page to achieve it (for consistency, we use the same model and dataset as an example). Supposing that you’ve already got a well-defined subclass MyNano, below line will instantiate it with enabling BFloat16 mixed precision and train your model.

[ ]:
MyNano(precision='bf16').train()

      The detailed definition of MyNano can be found in the runnable example.

However, using BF16 precision on CPU without BF16 instruction support may affect training efficiency. You can set use_ipex=True and precision='bf16' simultaneously to enable IPEX (Intel® Extension for PyTorch*), which adopts AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and other optimizations for BFloat16 mixed precision training to gain more acceleration:

[ ]:
MyNano(use_ipex=True, precision='bf16').train()

B) Use @nano decorator#

@nano decorator is very friendly since you can only add 2 new lines (import it and wrap the training function) and enjoy the features brought by BigDL-Nano if you have already defined a PyTorch training function with a model, optimizers, and dataloaders as parameters. You can learn the usage and notes of it from here. The only difference when using BFloat16 mixed precision for training is that you should specify the decorator as @nano(precision='bf16').

[ ]:
from tqdm import tqdm
from bigdl.nano.pytorch import nano # import nano decorator

@nano(precision='bf16') # apply the decorator to the training loop
def training_loop(model, optimizer, train_loader, num_epochs, loss_func):

    for epoch in range(num_epochs):

        model.train()
        train_loss, num = 0, 0
        with tqdm(train_loader, unit="batch") as tepoch:
            for data, target in tepoch:
                tepoch.set_description(f"Epoch {epoch}")
                optimizer.zero_grad()
                output = model(data)
                loss = loss_func(output, target)
                loss.backward()
                optimizer.step()
                loss_value = loss.sum()
                train_loss += loss_value
                num += 1
                tepoch.set_postfix(loss=loss_value)
            print(f'Train Epoch: {epoch}, avg_loss: {train_loss / num}')

      A runnable example including this training_loop can be seen from here.

However, using BF16 precision on CPU without BF16 instruction support may affect training efficiency. You can set use_ipex=True and precision='bf16' simultaneously to enable IPEX (Intel® Extension for PyTorch*), which adopts AVX-512 Vector Neural Network Instructions (AVX512 VNNI) and other optimizations for BFloat16 mixed precision training to gain more acceleration.