The Orca Library¶
1. Overview¶
Most AI projects start with a Python notebook running on a single laptop; however, one usually needs to go through a mountain of pains to scale it to handle larger data set in a distributed fashion. The Orca library seamlessly scales out your single node Python notebook across large clusters (so as to process distributed Big Data).
2. Install¶
We recommend using conda to prepare the Python environment.
conda create -n py37 python=3.7 # "py37" is conda environment name, you can use any name you like.
conda activate py37
pip install bigdl-orca
When installing bigdl-orca with pip, you can specify the extras key [ray]
to additionally install the additional dependencies
essential for running RayOnSpark
pip install bigdl-orca[ray]
You can install bigdl-orca nightly release version using
pip install --pre --upgrade bigdl-orca
pip install --pre --upgrade bigdl-orca[ray]
3. Run¶
This section uses TensorFlow 1.15, and you should install TensorFlow before running this example:
pip install tensorflow==1.15
First, initialize Orca Context:
from bigdl.orca import init_orca_context, OrcaContext
# cluster_mode can be "local", "k8s" or "yarn"
sc = init_orca_context(cluster_mode="local", cores=4, memory="10g", num_nodes=1)
Next, perform data-parallel processing in Orca (supporting standard Spark Dataframes, TensorFlow Dataset, PyTorch DataLoader, Pandas, etc.):
from pyspark.sql.functions import array
spark = OrcaContext.get_spark_session()
df = spark.read.parquet(file_path)
df = df.withColumn('user', array('user')) \
.withColumn('item', array('item'))
Finally, use sklearn-style Estimator APIs in Orca to perform distributed TensorFlow, PyTorch, Keras and BigDL training and inference:
from tensorflow import keras
from bigdl.orca.learn.tf.estimator import Estimator
user = keras.layers.Input(shape=[1])
item = keras.layers.Input(shape=[1])
feat = keras.layers.concatenate([user, item], axis=1)
predictions = keras.layers.Dense(2, activation='softmax')(feat)
model = keras.models.Model(inputs=[user, item], outputs=predictions)
model.compile(optimizer='rmsprop',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
est = Estimator.from_keras(keras_model=model)
est.fit(data=df,
batch_size=64,
epochs=4,
feature_cols=['user', 'item'],
label_cols=['label'])
Get Started¶
See TensorFlow and PyTorch quickstart for more details.