View the runnable example on GitHub

Accelerate TensorFlow Keras Customized Training Loop Using Multiple Instances#

BigDL-Nano provides a decorator nano (potentially with the help of nano_multiprocessing and nano_multiprocessing_loss) to handle keras model with customized training loop’s multiple instance training.

To use multiple instances for TensorFlow Keras training, you need to install BigDL-Nano for TensorFlow(or Intel-Tensorflow):

[ ]:
# install the nightly-built version of bigdl-nano for tensorflow;
!pip install --pre --upgrade bigdl-nano[stock_tensorflow_29,inference]
!source bigdl-nano-init  # set environment variables

📝 Note

Before starting your TensorFlow Keras application, it is highly recommended to run source bigdl-nano-init to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most TensorFlow Keras applications on training workloads.

⚠️ Warning

For Jupyter Notebook users, we recommend to run the commands above, especially source bigdl-nano-init before jupyter kernel is started, or some of the optimizations may not take effect.

⚠️ Warning

It has been found that some of the optimized malloc implementation applied by source bigdl-nano-init may cause memory leak. It could be avoided by unset LD_PRELOAD and unset MALLOC_CONF.

We may first define a dummy dataset and model for the example.

[ ]:
from import nano_multiprocessing, nano
import tensorflow as tf

global_batch_size = 32

model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
optimizer = tf.keras.optimizers.SGD()
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True)

dataset =[1.], [1.])).repeat(128).batch(

Basic usage for multi-process training on customized loop#

For customized training, users will define a personalized train_step (typically a tf.function) with their own gradient calculation and weight updating methods as well as a training loop (e.g., train_whole_data in following code block) to iterate over full dataset. For detailed information, you may refer to Tensorflow Tutorial for customized trianing loop.

To make them run in a multi-process way, you may only add 2 lines of code.

  • add nano_multiprocessing to the train_step function with gradient calculation and applying process.

  • add @nano(num_processes=...) to the training loop function with iteration over full dataset.

[ ]:
@nano_multiprocessing  # <-- Just remove this line to run on 1 process
def train_step(inputs, model, loss_object, optimizer):
    features, labels = inputs
    with tf.GradientTape() as tape:
        predictions = model(features, training=True)
        loss = loss_object(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

@nano(num_processes=2)  # <-- Just remove this line to run on 1 process
def train_whole_data(model, dataset, loss_object, optimizer, train_step):
    for inputs in dataset:
        print(train_step(inputs, model, loss_object, optimizer))

Then run your training loop function as normal, the process will magically run on several (e.g., 2 in this case) processes collaborately.

[ ]:
train_whole_data(model, dataset, loss_object, optimizer, train_step)

📝 Note

By setting num_processes, CPU cores will be automatically and evenly distributed among processes to avoid conflicts and maximize training throughput.

During Nano TensorFlow Keras multi-instance training, the effective batch size is still the batch_size specified in datasets (32 in this example). Because we choose to match the semantics of TensorFlow distributed training (MultiWorkerMirroredStrategy), which intends to split the batch into multiple sub-batches for different workers.

Advanced usage for customized loss#

Some times users may define their own loss function rather than use a pre-defined keras loss. We provide a nano_multiprocessing_loss decorator to support customized defined loss.

[ ]:
from tensorflow.keras import backend
from import nano_multiprocessing_loss

def loss_object(x, pred):
    res = backend.mean(tf.math.squared_difference(x, pred), axis=-1)
    return res
[ ]:
train_whole_data(model, dataset, loss_object, optimizer, train_step)

Advanced Usage for Data Generator#

Data Generator is frequently used for users who needs to carry out real time data generation or large number of files’ reading. Users should define them as a TFdataset by from_generator in this case and call an additionally dataset._GeneratorState = dataset._GeneratorState(generator)

[ ]:
def dummy_data_generator():
    for i in range(128):
        yield tf.constant([i]), tf.constant([i])

dataset =,
                                            output_signature=(tf.TensorSpec(shape=(1,), dtype=tf.float32),
                                                              tf.TensorSpec(shape=(1,), dtype=tf.float32)))

# necessary to initiate dataset._GeneratorState
dataset._GeneratorState = dataset._GeneratorState(dummy_data_generator)
[ ]:
train_whole_data(model, dataset, loss_object, optimizer, train_step)