View the runnable example on GitHub
Accelerate TensorFlow Keras Customized Training Loop Using Multiple Instances#
BigDL-Nano provides a decorator nano
(potentially with the help of nano_multiprocessing
and nano_multiprocessing_loss
) to handle keras model with customized training loop’s multiple instance training.
To use multiple instances for TensorFlow Keras training, you need to install BigDL-Nano for TensorFlow(or Intel-Tensorflow):
[ ]:
# install the nightly-built version of bigdl-nano for tensorflow;
!pip install --pre --upgrade bigdl-nano[stock_tensorflow_29,inference]
!source bigdl-nano-init # set environment variables
📝 Note
Before starting your TensorFlow Keras application, it is highly recommended to run
source bigdl-nano-init
to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most TensorFlow Keras applications on training workloads.
⚠️ Warning
For Jupyter Notebook users, we recommend to run the commands above, especially
source bigdl-nano-init
before jupyter kernel is started, or some of the optimizations may not take effect.
⚠️ Warning
It has been found that some of the optimized malloc implementation applied by
source bigdl-nano-init
may cause memory leak. It could be avoided byunset LD_PRELOAD
andunset MALLOC_CONF
.
We may first define a dummy dataset and model for the example.
[ ]:
from bigdl.nano.tf.keras import nano_multiprocessing, nano
import tensorflow as tf
tf.random.set_seed(0)
global_batch_size = 32
model = tf.keras.Sequential([tf.keras.layers.Dense(1, input_shape=(1,))])
optimizer = tf.keras.optimizers.SGD()
loss_object = tf.keras.losses.BinaryCrossentropy(from_logits=True)
dataset = tf.data.Dataset.from_tensors(([1.], [1.])).repeat(128).batch(
global_batch_size)
Basic usage for multi-process training on customized loop#
For customized training, users will define a personalized train_step
(typically a tf.function
) with their own gradient calculation and weight updating methods as well as a training loop (e.g., train_whole_data
in following code block) to iterate over full dataset. For detailed information, you may refer to Tensorflow Tutorial for customized trianing loop.
To make them run in a multi-process way, you may only add 2 lines of code.
add
nano_multiprocessing
to thetrain_step
function with gradient calculation and applying process.add
@nano(num_processes=...)
to the training loop function with iteration over full dataset.
[ ]:
@nano_multiprocessing # <-- Just remove this line to run on 1 process
@tf.function
def train_step(inputs, model, loss_object, optimizer):
features, labels = inputs
with tf.GradientTape() as tape:
predictions = model(features, training=True)
loss = loss_object(labels, predictions)
gradients = tape.gradient(loss, model.trainable_variables)
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss
@nano(num_processes=2) # <-- Just remove this line to run on 1 process
def train_whole_data(model, dataset, loss_object, optimizer, train_step):
for inputs in dataset:
print(train_step(inputs, model, loss_object, optimizer))
Then run your training loop function as normal, the process will magically run on several (e.g., 2 in this case) processes collaborately.
[ ]:
train_whole_data(model, dataset, loss_object, optimizer, train_step)
📝 Note
By setting
num_processes
, CPU cores will be automatically and evenly distributed among processes to avoid conflicts and maximize training throughput.During Nano TensorFlow Keras multi-instance training, the effective batch size is still the
batch_size
specified in datasets (32 in this example). Because we choose to match the semantics of TensorFlow distributed training (MultiWorkerMirroredStrategy
), which intends to split the batch into multiple sub-batches for different workers.
Advanced usage for customized loss#
Some times users may define their own loss function rather than use a pre-defined keras loss. We provide a nano_multiprocessing_loss
decorator to support customized defined loss.
[ ]:
from tensorflow.keras import backend
from bigdl.nano.tf.keras import nano_multiprocessing_loss
@nano_multiprocessing_loss()
def loss_object(x, pred):
res = backend.mean(tf.math.squared_difference(x, pred), axis=-1)
return res
[ ]:
train_whole_data(model, dataset, loss_object, optimizer, train_step)
Advanced Usage for Data Generator#
Data Generator is frequently used for users who needs to carry out real time data generation or large number of files’ reading. Users should define them as a TFdataset by from_generator
in this case and call an additionally dataset._GeneratorState = dataset._GeneratorState(generator)
[ ]:
def dummy_data_generator():
for i in range(128):
yield tf.constant([i]), tf.constant([i])
dataset = tf.data.Dataset.from_generator(dummy_data_generator,
output_signature=(tf.TensorSpec(shape=(1,), dtype=tf.float32),
tf.TensorSpec(shape=(1,), dtype=tf.float32)))
# necessary to initiate dataset._GeneratorState
dataset._GeneratorState = dataset._GeneratorState(dummy_data_generator)
[ ]:
train_whole_data(model, dataset, loss_object, optimizer, train_step)