Accelerate PyTorch Training using Multiple Instances#

TorchNano (bigdl.nano.pytorch.TorchNano) supports multi-instance training that can make full usage of hardwares with multiple CPU cores or sockets (especially when the number of cores is large). Here we provide 2 ways to achieve this: A) subclass TorchNano or B) use @nano decorator. You can choose the appropriate one depending on your (preferred) code structure.

📝 Note

Before starting your PyTorch application, it is highly recommended to run source bigdl-nano-init to set several environment variables based on your current hardware. Empirically, these variables will greatly improve performance for most PyTorch applications on training workloads.

A) Subclass `TorchNano`#

In general, two steps are required if you choose to subclass TorchNano:

import and subclass TorchNano, and override its train() method
instantiate it with setting num_processes , then call the train() method

For step 1, you can refer to this page to achieve it (for consistency, we use the same model and dataset as an example). Supposing that you’ve already got a well-defined subclass MyNano, below line will instantiate it and train your model with 2 processes.

[ ]:

MyNano(num_processes=2).train()

The detailed definition of MyNano can be found in the runnable example.

📝 Note

By setting num_processes, CPU cores will be automatically and evenly distributed among specific number of processes, to avoid conflicts and maximize training throughput. If you would like to specify the CPU cores used by each process, You could set cpu_for_each_process to a list of length num_processes, in which each item is a list of CPU indices.

Currently, ‘subprocess’ (default), ‘spawn’ and ‘ray’ are supported as distributed_backend for TorchNano.

Also note that, when using data-parallel training, the batch size is equivalent to becoming num_processes times larger. The learning rate warm-up strategy that gradually increases the learning rate to num_processes times is a compensate to achieve the same effect as single instance training. Nano enables this strategy by default through auto_lr=True.

B) Use `@nano` decorator#

@nano decorator is very friendly since you can only add 2 new lines (import it and wrap the training function) and enjoy the features brought by BigDL-Nano if you have already defined a PyTorch training function with a model, optimizers, and dataloaders as parameters. You can learn the usage and notes of it from here. The only difference when using multi-instance training is that you should specify the decorator as @nano(num_processes=n) with n being the expected number of processes.

[ ]:

from tqdm import tqdm
from bigdl.nano.pytorch import nano # import nano decorator

@nano(num_processes=2) # apply the decorator to the training loop
def training_loop(model, optimizer, train_loader, num_epochs, loss_func):

    for epoch in range(num_epochs):

        model.train()
        train_loss, num = 0, 0
        with tqdm(train_loader, unit="batch") as tepoch:
            for data, target in tepoch:
                tepoch.set_description(f"Epoch {epoch}")
                optimizer.zero_grad()
                output = model(data)
                loss = loss_func(output, target)
                loss.backward()
                optimizer.step()
                loss_value = loss.sum()
                train_loss += loss_value
                num += 1
                tepoch.set_postfix(loss=loss_value)
            print(f'Train Epoch: {epoch}, avg_loss: {train_loss / num}')

A runnable example including this training_loop can be seen from here.

📝 Note

By setting num_processes, CPU cores will be automatically and evenly distributed among specific number of processes, to avoid conflicts and maximize training throughput. If you would like to specify the CPU cores used by each process, You could set cpu_for_each_process to a list of length num_processes, in which each item is a list of CPU indices.

Currently, ‘subprocess’ (default), and ‘ray’ are supported as distributed_backend for @nano decorator ('spawn' is not supported by @nano).

Also note that, when using data-parallel training, the batch size is equivalent to becoming num_processes times larger. The learning rate warm-up strategy that gradually increases the learning rate to num_processes times is a compensate to achieve the same effect as single instance training. Nano enables this strategy by default through auto_lr=True.

📚 Related Readings

How to install BigDL-Nano

How to convert your PyTorch training loop to use TorchNano for acceleration

How to accelerate your PyTorch training loop with @nano decorator

How to choose the number of processes for multi-instance training

Accelerate PyTorch Training using Multiple Instances#

A) Subclass TorchNano#

B) Use @nano decorator#

A) Subclass `TorchNano`#

B) Use `@nano` decorator#