View the runnable example on GitHub

Apply SparseAdam Optimizer for Large Embeddings#

Embedding layers are often used to encode categorical items in deep learning applications. However, in applications such as recommendation systems, the embedding size may become huge due to large number of items or users, leading to extensive computational costs and space.

For large embeddings, the batch size could be orders of magnitude smaller compared to the embedding matrix size. Thus, gradients to the embedding matrix in each batch could be sparse. Taking advantage of this, BigDL-Nano provides bigdl.nano.tf.keras.layers.Embedding and bigdl.nano.tf.optimizers.SparseAdam to accelerate large embeddings. bigdl.nano.tf.optimizers.SparseAdam is a variant of Adam which handles updates of sparse tensor more efficiently. bigdl.nano.tf.keras.layers.Embedding intends to avoid applying regularizer function directly to the embedding matrix, which further avoids making the sparse gradient dense.

📝 Note

Before starting your TensorFlow Keras application, it is highly recommended to run source bigdl-nano-init to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most TensorFlow Keras applications on training workloads.

To optimize your model for large embedding, you need to import Nano’s Embedding and SparseAdam first:

[ ]:
from bigdl.nano.tf.keras.layers import Embedding
from bigdl.nano.tf.optimizers import SparseAdam

# from tf.keras import Model
from bigdl.nano.tf.keras import Model

📝 Note

You could import Model/Sequential from bigdl.nano.tf.keras instead of tf.keras to gain more optimizations from Nano. Please refer to API documentation for more information.

Let’s take the imdb_reviews dataset as an example, and suppose we would like to train a model to classify movie reviews as positive/negative. Assuming that the vocabulary size of reviews is \(20000\), and we want to fix the word vector to a length of \(128\), we would have a big embedding matrix with size \(20000 \times 128\).

To prepare the data for training, we need to process the samples as sequences of positive integers:

[ ]:
train_ds, val_ds, test_ds = create_datasets()

      The definition of create_datasets can be found in the runnable example.

We could then define the model. Same as using tf.keras.layers.Embedding, you could instantiate a Nano’s Embedding layer as the first layer in the model:

[ ]:
inputs = tf.keras.Input(shape=(None,), dtype="int64")

# 20000 is the vocabulary size,
# 128 is the embedding dimension
x = Embedding(input_dim=20000, output_dim=128)(inputs)

📝 Note

If you would like to apply a regularizer function to the embedding matrix through setting embeddings_regularizer, Nano will apply the regularizer to the output tensors of the embedding layer instead to avoid making the sparse gradient dense (if activity_regularize=None).

Please refer to API document for more information on bigdl.nano.tf.keras.layers.Embedding.

Next, you could define the remaining parts of the model, and configure the model for training with SparseAdam optimizer:

[ ]:
# define the remaining layers of the model
predictions = make_backbone()(x)
model = Model(inputs, predictions)

# Configure the model with Nano's SparseAdam optimizer
model.compile(loss="binary_crossentropy", optimizer=SparseAdam(), metrics=["accuracy"])

      The definition of make_backbone can be found in the runnable example.

📝 Note

SparseAdam optimizer is a variant of tf.keras.optimizers.Adam. This method only updates moments that show up in the gradient, and applies only those portions of gradient to the trainable variables.

Please refer to API document for more information on bigdl.nano.tf.optimizers.SparseAdam.

You could then train and evaluate your model as normal:

[ ]:
model.fit(train_ds, validation_data=val_ds, epochs=10)
model.evaluate(test_ds)

📚 Related Readings