TensorFlow Training#
BigDL-Nano can be used to accelerate TensorFlow Keras applications on training workloads. The optimizations in BigDL-Nano are delivered through
BigDL-Nano’s
Model
andSequential
classes, which have identical APIs withtf.keras.Model
andtf.keras.Sequential
with an enhancedfit
method.BigDL-Nano’s decorator
nano
(potentially with the help ofnano_multiprocessing
andnano_multiprocessing_loss
) to handle keras model with customized training loop.
We will briefly describe here the major features in BigDL-Nano for TensorFlow training.
Best Known Configurations#
When you install BigDL-Nano by pip install bigdl-nano[tensorflow]
, intel-tensorflow
will be installed in your environment, which has intel’s oneDNN optimizations enabled by default; and when you run source bigdl-nano-init
, it will export a few environment variables, such as OMP_NUM_THREADS
and KMP_AFFINITY
, according to your current hardware. Empirically, these environment variables work best for most TensorFlow applications. After setting these environment variables, you can just run your applications as usual (python app.py
) and no additional changes are required.
Multi-Instance Training#
When training on a server with dozens of CPU cores, it is often beneficial to use multiple training instances in a data-parallel fashion to make full use of the CPU cores. However
Naively using TensorFlow’s
MultiWorkerMirroredStrategy
can cause conflict in CPU cores and often cannot provide performance benefits.Customized training loop could be hard to use together with
MultiWorkerMirroredStrategy
BigDL-Nano makes it very easy to conduct multi-instance training correctly for default/customized training loop models.
Keras Model with default training loop#
You can just set the num_processes
parameter in the fit
method in your Model
or Sequential
object and BigDL-Nano will launch the specific number of processes to perform data-parallel training. Each process will be automatically pinned to a different subset of CPU cores to avoid conflict and maximize training throughput.
import tensorflow as tf
from tensorflow.keras import layers
from bigdl.nano.tf.keras import Sequential
model = Sequential([
layers.Rescaling(1. / 255, input_shape=(img_height, img_width, 3)),
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(32, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Conv2D(64, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
layers.Flatten(),
layers.Dense(128, activation='relu'),
layers.Dense(num_classes)
])
model.compile(optimizer='adam',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
model.fit(train_ds, epochs=3, validation_data=val_ds, num_processes=2)
Keras Model with customized training loop#
To make them run in a multi-process way, you may only add 2 lines of code.
add
nano_multiprocessing
to thetrain_step
function with gradient calculation and applying process.add
@nano(num_processes=...)
to the training loop function with iteration over full dataset.
from bigdl.nano.tf.keras import nano_multiprocessing, nano
import tensorflow as tf
tf.random.set_seed(0)
global_batch_size = 32
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
optimizer = tf.keras.optimizers.SGD()
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True)
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(128).batch(
global_batch_size)
@nano_multiprocessing # <-- Just remove this line to run on 1 process
@tf.function
def train_step(inputs, model, loss_object, optimizer):
features, labels = inputs
with tf.GradientTape() as tape:
predictions = model(features, training=True)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
@nano(num_processes=4) # <-- Just remove this line to run on 1 process
def train_whole_data(model, dataset, loss_object, optimizer, train_step):
for inputs in dataset:
print(train_step(inputs, model, loss_object, optimizer))
Note that, different from the conventions in BigDL-Nano PyTorch multi-instance training, the effective batch size will not change in TensorFlow multi-instance training, which means it is still the batch size you specify in your dataset. This is because TensorFlow’s MultiWorkerMirroredStrategy
will try to split the batch into multiple sub-batches for different workers. We chose this behavior to match the semantics of TensorFlow distributed training.
When you do want to increase your effective batch_size
, you can do so by directly changing it in your dataset definition and you may also want to gradually increase the learning rate linearly to the batch_size
, as described in this paper published by Facebook.