View the runnable example on GitHub

Accelerate PyTorch Lightning Training using Multiple Instances

bigdl.nano.pytorch.Trainer API extends PyTorch Lightning Trainer with multiple integrated optimizations. You can instantiate a BigDL-Nano Trainer with specified num_processes to benefit from multi-instance training on a server with multiple CPU cores or sockets, so that the workload can make full use of all CPU cores.

📝 Note

Before starting your PyTorch Lightning application, it is highly recommended to run source bigdl-nano-init to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most PyTorch Lightning applications on training workloads.

Let’s take a self-defined LightningModule (based on a ResNet-18 model pretrained on ImageNet dataset) and dataloaders to finetune the model on OxfordIIITPet dataset as an example:

[ ]:
model = MyLightningModule()
train_loader, val_loader = create_dataloaders()

      The definition of MyLightningModule and create_dataloaders can be found in the runnable example.

To enable multi-instance training, you could simply import BigDL-Nano Trainer, and set num_processes to an integer larger than 1:

[ ]:
from bigdl.nano.pytorch import Trainer

trainer = Trainer(max_epochs=5, num_processes=2)

📝 Note

By setting num_processes, Nano will launch the specific number of processes to perform data-parallel training. By default, CPU cores will be automatically and evenly distributed among processes to avoid conflicts and maximize training throughput. If you would like to specifiy the CPU cores used by each process, You could set cpu_for_each_process to a list of length num_processes, in which each item is a list of CPU indices.

During multi-instance training, the effective batch size is the batch_size (in dataloader) \(\times\) num_processes, which will cause the number of iterations in each epoch to reduce by a factor of num_processes. A common practice to compensate for this is to gradually increase the learning rate to num_processes times. You could find more details of this trick in this paper published by Facebook.

You could then do the multi-instance training and evaluation as normal:

[ ]:
trainer.fit(model, train_dataloaders=train_loader)
trainer.validate(model, dataloaders=val_loader)

📚 Related Readings