如何使用 Slurm 在集群上运行 Spark？

Question 1

为了使用 spark 上下文运行应用程序，首先需要运行 Slurm 作业，该作业会启动一个主服务器和一些工作服务器。使用 Slurm 时需要注意以下几点：

不要将 Spark 作为守护进程启动
让 Spark 工作器仅使用 Slurm 作业所需的核心和内存
为了在同一个作业中运行 master 和 worker，你必须在脚本的某个地方进行分支

我正在使用安装到的 Linux 二进制文件。请记住在脚本中用一些有效值$HOME/spark-1.5.2-bin-hadoop2.6/替换<username>和。<shared folder>

#!/bin/bash
#start_spark_slurm.sh

#SBATCH --nodes=3
#  ntasks per node MUST be one, because multiple slaves per work doesn't
#  work well with slurm + spark in this script (they would need increasing 
#  ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500
#  Beware! $HOME will not be expanded and invalid paths will result Slurm jobs
#  hanging indefinitely with status CG (completing) when calling scancel!
#SBATCH --output="/home/<username>/spark/logs/%j.out"
#SBATCH --error="/home/<username>/spark/logs/%j.err"
#SBATCH --time=01:00:00

# This section will be run when started by sbatch
if [ "$1" != 'srunning' ]; then
    this=$0
    # I experienced problems with some nodes not finding the script:
    #   slurmstepd: execve(): /var/spool/slurm/job123/slurm_script:
    #   No such file or directory
    # that's why this script is being copied to a shared location to which 
    # all nodes have access to:
    script=/<shared folder>/${SLURM_JOBID}_$( basename -- "$0" )
    cp "$this" "$script"

    # This might not be necessary on all clusters
    module load scala/2.10.4 java/jdk1.7.0_25 cuda/7.0.28

    export sparkLogs=$HOME/spark/logs
    export sparkTmp=$HOME/spark/tmp
    mkdir -p -- "$sparkLogs" "$sparkTmp"

    export SPARK_ROOT=$HOME/spark-1.5.2-bin-hadoop2.6/
    export SPARK_WORKER_DIR=$sparkLogs
    export SPARK_LOCAL_DIRS=$sparkLogs
    export SPARK_MASTER_PORT=7077
    export SPARK_MASTER_WEBUI_PORT=8080
    export SPARK_WORKER_CORES=$SLURM_CPUS_PER_TASK
    export SPARK_DAEMON_MEMORY=$(( $SLURM_MEM_PER_CPU * $SLURM_CPUS_PER_TASK / 2 ))m
    export SPARK_MEM=$SPARK_DAEMON_MEMORY

    srun "$script" 'srunning'
# If run by srun, then decide by $SLURM_PROCID whether we are master or worker
else
    source "$SPARK_ROOT/sbin/spark-config.sh"
    source "$SPARK_PREFIX/bin/load-spark-env.sh"
    if [ "$SLURM_PROCID" -eq 0 ]; then
        export SPARK_MASTER_IP=$( hostname )
        MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )

        # The saved IP address + port is necessary alter for submitting jobs
        echo "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT" > "$sparkLogs/${SLURM_JOBID}_spark_master"

        "$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.master.Master \
            --ip "$SPARK_MASTER_IP"                                         \
            --port "$SPARK_MASTER_PORT "                                    \
            --webui-port "$SPARK_MASTER_WEBUI_PORT"
    else
        # $(scontrol show hostname) is used to convert e.g. host20[39-40]
        # to host2039 this step assumes that SLURM_PROCID=0 corresponds to 
        # the first node in SLURM_NODELIST !
        MASTER_NODE=spark://$( scontrol show hostname $SLURM_NODELIST | head -n 1 ):7077
        "$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.worker.Worker $MASTER_NODE
    fi
fi

现在启动 sbatch 作业，然后example.jar：

mkdir -p -- "$HOME/spark/logs"
jobid=$( sbatch ./start_spark_slurm.sh )
jobid=${jobid##Submitted batch job }
MASTER_WEB_UI=''
while [ -z "$MASTER_WEB_UI" ]; do 
    sleep 1s
    if [ -f "$HOME/spark/logs/$jobid.err" ]; then
        MASTER_WEB_UI=$( sed -n -r 's|.*Started MasterWebUI at (http://[0-9.:]*)|\1|p' "$HOME/spark/logs/$jobid.err" )
    fi
done
MASTER_ADDRESS=$( cat -- "$HOME/spark/logs/${jobid}_spark_master" ) 
"$HOME/spark-1.5.2-bin-hadoop2.6/bin/spark-submit" --master "$MASTER_ADDRESS" example.jar
firefox "$MASTER_WEB_UI"

Answer

为了使用 spark 上下文运行应用程序，首先需要运行 Slurm 作业，该作业会启动一个主服务器和一些工作服务器。使用 Slurm 时需要注意以下几点：

不要将 Spark 作为守护进程启动
让 Spark 工作器仅使用 Slurm 作业所需的核心和内存
为了在同一个作业中运行 master 和 worker，你必须在脚本的某个地方进行分支

我正在使用安装到的 Linux 二进制文件。请记住在脚本中用一些有效值$HOME/spark-1.5.2-bin-hadoop2.6/替换<username>和。<shared folder>

#!/bin/bash
#start_spark_slurm.sh

#SBATCH --nodes=3
#  ntasks per node MUST be one, because multiple slaves per work doesn't
#  work well with slurm + spark in this script (they would need increasing 
#  ports among other things)
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=500
#  Beware! $HOME will not be expanded and invalid paths will result Slurm jobs
#  hanging indefinitely with status CG (completing) when calling scancel!
#SBATCH --output="/home/<username>/spark/logs/%j.out"
#SBATCH --error="/home/<username>/spark/logs/%j.err"
#SBATCH --time=01:00:00

# This section will be run when started by sbatch
if [ "$1" != 'srunning' ]; then
    this=$0
    # I experienced problems with some nodes not finding the script:
    #   slurmstepd: execve(): /var/spool/slurm/job123/slurm_script:
    #   No such file or directory
    # that's why this script is being copied to a shared location to which 
    # all nodes have access to:
    script=/<shared folder>/${SLURM_JOBID}_$( basename -- "$0" )
    cp "$this" "$script"

    # This might not be necessary on all clusters
    module load scala/2.10.4 java/jdk1.7.0_25 cuda/7.0.28

    export sparkLogs=$HOME/spark/logs
    export sparkTmp=$HOME/spark/tmp
    mkdir -p -- "$sparkLogs" "$sparkTmp"

    export SPARK_ROOT=$HOME/spark-1.5.2-bin-hadoop2.6/
    export SPARK_WORKER_DIR=$sparkLogs
    export SPARK_LOCAL_DIRS=$sparkLogs
    export SPARK_MASTER_PORT=7077
    export SPARK_MASTER_WEBUI_PORT=8080
    export SPARK_WORKER_CORES=$SLURM_CPUS_PER_TASK
    export SPARK_DAEMON_MEMORY=$(( $SLURM_MEM_PER_CPU * $SLURM_CPUS_PER_TASK / 2 ))m
    export SPARK_MEM=$SPARK_DAEMON_MEMORY

    srun "$script" 'srunning'
# If run by srun, then decide by $SLURM_PROCID whether we are master or worker
else
    source "$SPARK_ROOT/sbin/spark-config.sh"
    source "$SPARK_PREFIX/bin/load-spark-env.sh"
    if [ "$SLURM_PROCID" -eq 0 ]; then
        export SPARK_MASTER_IP=$( hostname )
        MASTER_NODE=$( scontrol show hostname $SLURM_NODELIST | head -n 1 )

        # The saved IP address + port is necessary alter for submitting jobs
        echo "spark://$SPARK_MASTER_IP:$SPARK_MASTER_PORT" > "$sparkLogs/${SLURM_JOBID}_spark_master"

        "$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.master.Master \
            --ip "$SPARK_MASTER_IP"                                         \
            --port "$SPARK_MASTER_PORT "                                    \
            --webui-port "$SPARK_MASTER_WEBUI_PORT"
    else
        # $(scontrol show hostname) is used to convert e.g. host20[39-40]
        # to host2039 this step assumes that SLURM_PROCID=0 corresponds to 
        # the first node in SLURM_NODELIST !
        MASTER_NODE=spark://$( scontrol show hostname $SLURM_NODELIST | head -n 1 ):7077
        "$SPARK_ROOT/bin/spark-class" org.apache.spark.deploy.worker.Worker $MASTER_NODE
    fi
fi

现在启动 sbatch 作业，然后example.jar：

mkdir -p -- "$HOME/spark/logs"
jobid=$( sbatch ./start_spark_slurm.sh )
jobid=${jobid##Submitted batch job }
MASTER_WEB_UI=''
while [ -z "$MASTER_WEB_UI" ]; do 
    sleep 1s
    if [ -f "$HOME/spark/logs/$jobid.err" ]; then
        MASTER_WEB_UI=$( sed -n -r 's|.*Started MasterWebUI at (http://[0-9.:]*)|\1|p' "$HOME/spark/logs/$jobid.err" )
    fi
done
MASTER_ADDRESS=$( cat -- "$HOME/spark/logs/${jobid}_spark_master" ) 
"$HOME/spark-1.5.2-bin-hadoop2.6/bin/spark-submit" --master "$MASTER_ADDRESS" example.jar
firefox "$MASTER_WEB_UI"

Question 2

作为maxmlnkn 答案指出，在 Spark jar 可以通过 spark-submit 执行之前，您需要一种机制来在 Slurm 配置中设置/启动适当的 Spark 守护进程。

已经开发了几个脚本/系统来为您完成此设置。您上面链接的答案提到了 Magpie @https://github.com/LLNL/magpie（完全披露：我是这些脚本的开发者/维护者）。Magpie 提供了一个作业提交文件（submission-scripts/script-sbatch-srun/magpie.sbatch-srun-spark）供您编辑并将集群细节和作业脚本放入其中以执行。配置完成后，您可以通过“sbatch -k ./magpie.sbatch-srun-spark”提交此文件。有关更多详细信息，请参阅 doc/README.spark。

我会提到还有其他脚本/系统可以帮你完成这个任务。我缺乏使用它们的经验，所以除了在下面链接它们之外，我无法发表评论。

https://github.com/glennklockwood/myhadoop

https://github.com/hpcugent/hanythingondemand

Answer