View the runnable example on GitHub

Accelerate TensorFlow Keras Training using Multiple Instances#

BigDL-Nano provides and which extend tf.keras.Model and tf.keras.Sequential separately with various optimizations. To use multi-instance training on a server with multiple CPU cores or sockets, you just replace tf.keras.Model/Sequential in your code with, and call fit with specified num_processes.

📝 Note

Before starting your TensorFlow Keras application, it is highly recommended to run source bigdl-nano-init to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most TensorFlow Keras applications on training workloads.

First, import Model or Sequential from instead of tf.keras. Let’s take the Model class here as an example:

[ ]:
# from tf.keras import Model
from import Model

Suppose we would like to train a ResNet50 model (pretrained on ImageNet dataset) on the imagenette dataset, we need to create the corresponding train/test datasets, and define the model:

[ ]:
# create train/test datasets
train_ds, test_ds, ds_info = create_datasets(img_size=224, batch_size=32)

# Model creation steps are the same as using tf.keras.Model
inputs, outputs = define_model_inputs_outputs(num_classes=ds_info.features['label'].num_classes,

model = Model(inputs=inputs, outputs=outputs)
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=['accuracy'])

      The definition of create_datasets and define_model_inputs_outputs can be found in the runnable example.

You could then call the fit method with num_processes set to an integer larger than 1 to launch the specific number of processes for data-parallel training:

[ ]:,
          steps_per_epoch=(ds_info.splits['train'].num_examples // 32),

📝 Note

By setting num_processes, CPU cores will be automatically and evenly distributed among processes to avoid conflicts and maximize training throughput.

During Nano TensorFlow Keras multi-instance training, the effective batch size is still the batch_size specified in datasets (32 in this example). Because we choose to match the semantics of TensorFlow distributed training (MultiWorkerMirroredStrategy), which intends to split the batch into multiple sub-batches for different workers.