BigDL-Nano TensorFlow Training Overview¶
BigDL-Nano can be used to accelerate TensorFlow Keras applications on training workloads. The optimizations in BigDL-Nano are delivered through BigDL-Nano’s
Sequential classes, which have identical APIs with
tf.keras.Sequential. For most cases, you can just replace your
bigdl.nano.tf.keras.Sequential to benefits from BigDL-Nano.
We will briefly describe here the major features in BigDL-Nano for TensorFlow training. You can find complete examples here links to be added.
Best Known Configurations¶
When you install BigDL-Nano by
pip install bigdl-nano[tensorflow], intel-tensorflow will be installed in your environment, which has intel’s oneDNN optimizations enabled by default; and when you run
source bigdl-nano-init, it will export a few environment variables, such as OMP_NUM_THREADS and KMP_AFFINITY, according to your current hardware. Empirically, these environment variables work best for most TensorFlow applications. After setting these environment variables, you can just run your applications as usual (
python app.py) and no additional changes are required.
When training on a server with dozens of CPU cores, it is often beneficial to use multiple training instances in a data-parallel fashion to make full use of the CPU cores. However, naively using TensorFlow’s
MultiWorkerMirroredStrategy can cause conflict in CPU cores and often cannot provide performance benefits.
BigDL-Nano makes it very easy to conduct multi-instance training correctly. You can just set the
num_processes parameter in the
fit method in your
Sequential object and BigDL-Nano will launch the specific number of processes to perform data-parallel training. Each process will be automatically pinned to a different subset of CPU cores to avoid conflict and maximize training throughput.
import tensorflow as tf from tensorflow.keras import layers from bigdl.nano.tf.keras import Sequential model = Sequential([ layers.Rescaling(1. / 255, input_shape=(img_height, img_width, 3)), layers.Conv2D(16, 3, padding='same', activation='relu'), layers.MaxPooling2D(), layers.Conv2D(32, 3, padding='same', activation='relu'), layers.MaxPooling2D(), layers.Conv2D(64, 3, padding='same', activation='relu'), layers.MaxPooling2D(), layers.Flatten(), layers.Dense(128, activation='relu'), layers.Dense(num_classes) ]) model.compile(optimizer='adam', loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True), metrics=['accuracy']) model.fit(train_ds, epochs=3, validation_data=val_ds, num_processes=2)
Note that, different from the conventions in PyTorch, the effective batch size will not change in TensorFlow multi-instance training, which means it is still the batch size you specify in your dataset. This is because TensorFlow’s
MultiWorkerMirroredStrategy will try to split the batch into multiple sub-batches for different workers. We chose this behavior to match the semantics of TensorFlow distributed training.
When you do want to increase your effective batch_size, you can do so by directly changing it in your dataset definition and you may also want to gradually increase the learning rate linearly to the batch_size, as described in the Facebook paper.