Databricks User Guide


You can run BigDL program on the Databricks cluster as follows.

1. Create a Databricks Cluster

  • Create either AWS Databricks workspace or Azure Databricks workspace.

  • Create a Databricks clusters using the UI. Choose Databricks runtime version. This guide is tested on Runtime 7.3 (includes Apache Spark 3.0.1, Scala 2.12).

2. Installing BigDL Python libraries

In the left pane, click Clusters and select your cluster.

../../_images/cluster.png

Install BigDL DLLib python environment using prebuilt release Wheel package. Click Libraries > Install New > Upload > Python Whl. Download BigDL DLLib prebuilt Wheel here. Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks.

../../_images/dllib-whl.png

Install BigDL Orca python environment using prebuilt release Wheel package. Click Libraries > Install New > Upload > Python Whl. Download Bigdl Orca prebuilt Wheel here. Choose a wheel with timestamp for the same Spark version and platform as Databricks runtime. Download and drop it on Databricks.

../../_images/orca-whl.png

If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt release Wheel package from here and install to cluster in the similar ways.

3. Installing BigDL Java libraries

Install BigDL DLLib prebuilt jar package. Click Libraries > Install New > Upload > Jar. Download BigDL DLLib prebuilt package from Release Page. Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named “bigdl-dllib-spark_*-jar-with-dependencies.jar” in the lib directory. Drop the jar on Databricks.

../../_images/dllib-jar.png

Install BigDL Orca prebuilt jar package. Click Libraries > Install New > Upload > Jar. Download BigDL Orca prebuilt package from Release Page. Please note that you should choose the same spark version of package as your Databricks runtime version. Find jar named “bigdl-orca-spark_*-jar-with-dependencies.jar” in the lib directory. Drop the jar on Databricks.

../../_images/orca-jar.png

If you want to use other BigDL libraries (Friesian, Chronos, Nano, Serving, etc.), download prebuilt jar package from Release Page and install to cluster in the similar ways.

Make sure the jar files and whl files are installed on all clusters. In Libraries tab of your cluster, check installed libraries and click “Install automatically on all clusters” option in Admin Settings.

../../_images/apply-all.png

4. Setting Spark configuration

On the cluster configuration page, click the Advanced Options toggle. Click the Spark tab. You can provide custom Spark configuration properties in a cluster configuration. Please set it according to your cluster resource and program needs.

../../_images/Databricks5.PNG

See below for an example of Spark config setting needed by BigDL. Here it sets 2 core per executor. Note that “spark.cores.max” needs to be properly set below.

spark.shuffle.reduceLocality.enabled false
spark.serializer org.apache.spark.serializer.JavaSerializer
spark.shuffle.blockTransferService nio
spark.databricks.delta.preview.enabled true
spark.executor.cores 2
spark.speculation false
spark.scheduler.minRegisteredResourcesRatio 1.0
spark.scheduler.maxRegisteredResourcesWaitingTime 3600s
spark.cores.max 4

5. Running BigDL on Databricks

Open a new notebook, and call init_orca_context at the beginning of your code (with cluster_mode set to “spark-submit”).

from bigdl.orca import init_orca_context, stop_orca_context
init_orca_context(cluster_mode="spark-submit")

Output on Databricks:

../../_images/spark-context.png

6. Install other third-party libraries on Databricks if necessary

If you want to use other third-party libraries, check related Databricks documentation of libraries for AWS Databricks and libraries for Azure Databricks.