TensorFlow Training#

BigDL-Nano can be used to accelerate TensorFlow Keras applications on training workloads. The optimizations in BigDL-Nano are delivered through

BigDL-Nano’s Model and Sequential classes, which have identical APIs with tf.keras.Model and tf.keras.Sequential with an enhanced fit method.
BigDL-Nano’s decorator nano (potentially with the help of nano_multiprocessing and nano_multiprocessing_loss) to handle keras model with customized training loop.

We will briefly describe here the major features in BigDL-Nano for TensorFlow training.

Best Known Configurations#

When you install BigDL-Nano by pip install bigdl-nano[tensorflow], intel-tensorflow will be installed in your environment, which has intel’s oneDNN optimizations enabled by default; and when you run source bigdl-nano-init, it will export a few environment variables, such as OMP_NUM_THREADS and KMP_AFFINITY, according to your current hardware. Empirically, these environment variables work best for most TensorFlow applications. After setting these environment variables, you can just run your applications as usual (python app.py) and no additional changes are required.

Multi-Instance Training#

When training on a server with dozens of CPU cores, it is often beneficial to use multiple training instances in a data-parallel fashion to make full use of the CPU cores. However

Naively using TensorFlow’s MultiWorkerMirroredStrategy can cause conflict in CPU cores and often cannot provide performance benefits.
Customized training loop could be hard to use together with MultiWorkerMirroredStrategy

BigDL-Nano makes it very easy to conduct multi-instance training correctly for default/customized training loop models.

Keras Model with default training loop#

You can just set the num_processes parameter in the fit method in your Model or Sequential object and BigDL-Nano will launch the specific number of processes to perform data-parallel training. Each process will be automatically pinned to a different subset of CPU cores to avoid conflict and maximize training throughput.

import tensorflow as tf
from tensorflow.keras import layers
from bigdl.nano.tf.keras import Sequential

model = Sequential([
    layers.Rescaling(1. / 255, input_shape=(img_height, img_width, 3)),
    layers.Conv2D(16, 3, padding='same', activation='relu'),
    layers.MaxPooling2D(),
    layers.Conv2D(32, 3, padding='same', activation='relu'),
    layers.MaxPooling2D(),
    layers.Conv2D(64, 3, padding='same', activation='relu'),
    layers.MaxPooling2D(),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(num_classes)
])

model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])

model.fit(train_ds, epochs=3, validation_data=val_ds, num_processes=2)

Keras Model with customized training loop#

To make them run in a multi-process way, you may only add 2 lines of code.

add nano_multiprocessing to the train_step function with gradient calculation and applying process.
add @nano(num_processes=...) to the training loop function with iteration over full dataset.

from bigdl.nano.tf.keras import nano_multiprocessing, nano
import tensorflow as tf

tf.random.set_seed(0)
global_batch_size = 32

model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
optimizer = tf.keras.optimizers.SGD()
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True)

dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(128).batch(
    global_batch_size)

@nano_multiprocessing  # <-- Just remove this line to run on 1 process
@tf.function
def train_step(inputs, model, loss_object, optimizer):
    features, labels = inputs
    with tf.GradientTape() as tape:
        predictions = model(features, training=True)
        loss = loss_object(labels, predictions)
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))
    return loss

@nano(num_processes=4)  # <-- Just remove this line to run on 1 process
def train_whole_data(model, dataset, loss_object, optimizer, train_step):
    for inputs in dataset:
        print(train_step(inputs, model, loss_object, optimizer))

Note that, different from the conventions in BigDL-Nano PyTorch multi-instance training, the effective batch size will not change in TensorFlow multi-instance training, which means it is still the batch size you specify in your dataset. This is because TensorFlow’s MultiWorkerMirroredStrategy will try to split the batch into multiple sub-batches for different workers. We chose this behavior to match the semantics of TensorFlow distributed training.

When you do want to increase your effective batch_size, you can do so by directly changing it in your dataset definition and you may also want to gradually increase the learning rate linearly to the batch_size, as described in this paper published by Facebook.