Hadoop/YARN User Guide#
Hadoop version: Apache Hadoop >= 2.7 (3.X included) or CDH 5.X. CDH 6.X have not been tested and thus currently not supported.
For Scala users, please see Scala User Guide for how to run BigDL on Hadoop/YARN clusters.
For Python users, you can run BigDL programs on standard Hadoop/YARN clusters without any changes to the cluster (i.e., no need to pre-install BigDL or other Python libraries on all nodes in the cluster).
1. Prepare Python Environment#
You need to first use conda to prepare the Python environment on the local machine where you submit your application. Create a conda environment, install BigDL and all the needed Python libraries in the created conda environment:
conda create -n bigdl python=3.7 # "bigdl" is conda environment name, you can use any name you like. conda activate bigdl pip install bigdl # Use conda or pip to install all the needed Python dependencies in the created conda environment.
View the Python User Guide for more details for BigDL installation.
You need to download and install JDK in the environment, and properly set the environment variable
JAVA_HOME
, which is required by Spark. JDK8 is highly recommended.You may take the following commands as a reference for installing OpenJDK:
# For Ubuntu sudo apt-get install openjdk-8-jre export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/ # For CentOS su -c "yum install java-1.8.0-openjdk" export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.282.b08-1.el7_9.x86_64/jre export PATH=$PATH:$JAVA_HOME/bin java -version # Verify the version of JDK.
Check the Hadoop setup and configurations of your cluster. Make sure you properly set the environment variable
HADOOP_CONF_DIR
, which is needed to initialize Spark on YARN:export HADOOP_CONF_DIR=the directory of the hadoop and yarn configurations
For CDH users
If your CDH cluster has already installed Spark, the CDH’s Spark might be conflict with the pyspark installed by pip required by BigDL.
Thus before running BigDL applications, you should unset all the Spark related environment variables. You can use
env | grep SPARK
to find all the existing Spark environment variables.Also, a CDH cluster’s
HADOOP_CONF_DIR
should be/etc/hadoop/conf
on CDH by default.
2. Run on YARN with built-in function#
This is the easiest and most recommended way to run BigDL on YARN, as you don’t need to care about environment preparation and Spark related commands. In this way, you can easily switch your job between local (for test) and YARN (for production) by changing the “cluster_mode”.
Call
init_orca_context
at the very beginning of your code to initiate and run BigDL on standard Hadoop/YARN clusters:from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="yarn-client", cores=4, memory="10g", num_nodes=2)
init_orca_context
would automatically prepare the runtime Python environment, detect the current Hadoop configurations fromHADOOP_CONF_DIR
and initiate the distributed execution engine on the underlying YARN cluster. View Orca Context for more details.By specifying “cluster_mode” to be
yarn-client
oryarn-cluster
,init_orca_context
will submit the job to YARN with client and cluster mode respectively.The difference between
yarn-client
andyarn-cluster
is where you run your Spark driver. Foryarn-client
, the Spark driver will run on the node where you start Python, while foryarn-cluster
the Spark driver will run on a random node in the YARN cluster. So if you are running withyarn-cluster
, you should change the application’s data loading from local file to a network file system (e.g. HDFS).You can then simply run your BigDL program in a Jupyter notebook. Note that jupyter cannot run on yarn-cluster, as the driver is not running on the local node.
jupyter notebook --notebook-dir=./ --ip=* --no-browser
Or you can run your BigDL program as a normal Python script (e.g. script.py) and in this case both
yarn-client
andyarn-cluster
are supported.python script.py
3. Run on YARN with spark-submit#
Follow the steps below if you need to run BigDL with spark-submit.
Pack the current active conda environment to
environment.tar.gz
(you can use any name you like) in the current working directory:conda pack -o environment.tar.gz
You need to write your BigDL program as a Python script. In the script, you need to call
init_orca_context
at the very beginning of your code and specify “cluster_mode” to bespark-submit
:from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="spark-submit")
Use
spark-submit
to submit your BigDL program (e.g. script.py). You can adjust the configurations according to your cluster settings. Note that ifenvironment.tar.gz
is not under the same directory withscript.py
, you may need to modify its path in--archives
in the running command below.Setup environment variables:
export SPARK_HOME=/path/to/spark # the folder path where you extract the Spark package export SPARK_VERSION="downloaded spark version" export BIGDL_HOME=/path/to/unzipped_BigDL export BIGDL_VERSION="downloaded BigDL version"
For
yarn-cluster
mode:${SPARK_HOME}/bin/spark-submit \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \ --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \ --jars ${BIGDL_HOME}/jars/bigdl-assembly-spark_${SPARK_VERSION}-${BIGDL_VERSION}-jar-with-dependencies.jar \ --master yarn \ --deploy-mode cluster \ --executor-memory 10g \ --driver-memory 10g \ --executor-cores 8 \ --num-executors 2 \ --archives environment.tar.gz#environment \ script.py
Note: For
yarn-cluster
, the Spark driver is running in a YARN container as well and thus both the driver and executors will use the Python interpreter inenvironment.tar.gz
. If you want to operate HDFS as some certain user, you can addspark.yarn.appMasterEnv.HADOOP_USER_NAME=username
to SparkConf.For
yarn-client
mode:${SPARK_HOME}/bin/spark-submit \ --conf spark.pyspark.driver.python=/path/to/python \ --conf spark.pyspark.python=environment/bin/python \ --jars ${BIGDL_HOME}/jars/bigdl-assembly-spark_${SPARK_VERSION}-${BIGDL_VERSION}-jar-with-dependencies.jar \ --master yarn \ --deploy-mode client \ --executor-memory 10g \ --driver-memory 10g \ --executor-cores 8 \ --num-executors 2 \ --archives environment.tar.gz#environment \ script.py
Note: For
yarn-client
, the Spark driver is running on local and it will use the Python interpreter in the current active conda environment while the executors will use the Python interpreter inenvironment.tar.gz
.
4. Run on YARN with bigdl-submit#
Follow the steps below if you need to run BigDL with bigdl-submit.
Pack the current active conda environment to
environment.tar.gz
(you can use any name you like) in the current working directory:conda pack -o environment.tar.gz
You need to write your BigDL program as a Python script. In the script, you need to call
init_orca_context
at the very beginning of your code and specify “cluster_mode” to bebigdl-submit
:from bigdl.orca import init_orca_context sc = init_orca_context(cluster_mode="bigdl-submit")
Use
bigdl-submit
to submit your BigDL program (e.g. script.py). You can adjust the configurations according to your cluster settings. Note that ifenvironment.tar.gz
is not under the same directory withscript.py
, you may need to modify its path in--archives
in the running command below.For
yarn-cluster
mode:bigdl-submit \ --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=environment/bin/python \ --conf spark.executorEnv.PYSPARK_PYTHON=environment/bin/python \ --master yarn \ --deploy-mode cluster \ --executor-memory 10g \ --driver-memory 10g \ --executor-cores 8 \ --num-executors 2 \ --archives environment.tar.gz#environment \ script.py
Note: For
yarn-cluster
, the Spark driver is running in a YARN container as well and thus both the driver and executors will use the Python interpreter inenvironment.tar.gz
. If you want to operate HDFS as some certain user, you can addspark.yarn.appMasterEnv.HADOOP_USER_NAME=username
to SparkConf.For
yarn-client
mode:PYSPARK_PYTHON=environment/bin/python bigdl-submit \ --master yarn \ --deploy-mode client \ --executor-memory 10g \ --driver-memory 10g \ --executor-cores 8 \ --num-executors 2 \ --archives environment.tar.gz#environment \ script.py
Note: For
yarn-client
, the Spark driver is running on local and it will use the Python interpreter in the current active conda environment while the executors will use the Python interpreter inenvironment.tar.gz
.