View the runnable example on GitHub
Apply SparseAdam
Optimizer for Large Embeddings#
Embedding layers are often used to encode categorical items in deep learning applications. However, in applications such as recommendation systems, the embedding size may become huge due to large number of items or users, leading to extensive computational costs and space.
For large embeddings, the batch size could be orders of magnitude smaller compared to the embedding matrix size. Thus, gradients to the embedding matrix in each batch could be sparse. Taking advantage of this, BigDL-Nano provides bigdl.nano.tf.keras.layers.Embedding
and bigdl.nano.tf.optimizers.SparseAdam
to accelerate large embeddings. bigdl.nano.tf.optimizers.SparseAdam
is a variant of Adam which handles updates of sparse tensor more efficiently.
bigdl.nano.tf.keras.layers.Embedding
intends to avoid applying regularizer function directly to the embedding matrix, which further avoids making the sparse gradient dense.
📝 Note
Before starting your TensorFlow Keras application, it is highly recommended to run
source bigdl-nano-init
to set several environment variables based on your current hardware. Empirically, these variables will bring big performance increase for most TensorFlow Keras applications on training workloads.
To optimize your model for large embedding, you need to import Nano’s Embedding
and SparseAdam
first:
[ ]:
from bigdl.nano.tf.keras.layers import Embedding
from bigdl.nano.tf.optimizers import SparseAdam
# from tf.keras import Model
from bigdl.nano.tf.keras import Model
📝 Note
You could import
Model
/Sequential
frombigdl.nano.tf.keras
instead oftf.keras
to gain more optimizations from Nano. Please refer to API documentation for more information.
Let’s take the imdb_reviews dataset as an example, and suppose we would like to train a model to classify movie reviews as positive/negative. Assuming that the vocabulary size of reviews is \(20000\), and we want to fix the word vector to a length of \(128\), we would have a big embedding matrix with size \(20000 \times 128\).
To prepare the data for training, we need to process the samples as sequences of positive integers:
[ ]:
train_ds, val_ds, test_ds = create_datasets()
The definition of create_datasets
can be found in the runnable example.
We could then define the model. Same as using tf.keras.layers.Embedding
, you could instantiate a Nano’s Embedding
layer as the first layer in the model:
[ ]:
inputs = tf.keras.Input(shape=(None,), dtype="int64")
# 20000 is the vocabulary size,
# 128 is the embedding dimension
x = Embedding(input_dim=20000, output_dim=128)(inputs)
📝 Note
If you would like to apply a regularizer function to the embedding matrix through setting
embeddings_regularizer
, Nano will apply the regularizer to the output tensors of the embedding layer instead to avoid making the sparse gradient dense (ifactivity_regularize=None
).Please refer to API document for more information on
bigdl.nano.tf.keras.layers.Embedding
.
Next, you could define the remaining parts of the model, and configure the model for training with SparseAdam
optimizer:
[ ]:
# define the remaining layers of the model
predictions = make_backbone()(x)
model = Model(inputs, predictions)
# Configure the model with Nano's SparseAdam optimizer
model.compile(loss="binary_crossentropy", optimizer=SparseAdam(), metrics=["accuracy"])
The definition of make_backbone
can be found in the runnable example.
📝 Note
SparseAdam
optimizer is a variant oftf.keras.optimizers.Adam
. This method only updates moments that show up in the gradient, and applies only those portions of gradient to the trainable variables.Please refer to API document for more information on
bigdl.nano.tf.optimizers.SparseAdam
.
You could then train and evaluate your model as normal:
[ ]:
model.fit(train_ds, validation_data=val_ds, epochs=10)
model.evaluate(test_ds)
📚 Related Readings