AWS 上的 Slurm 返回 slurmstepd：错误：execve()：：没有此文件或目录

2024-6-19 • tag-icon

AWS 上的 Slurm 返回 slurmstepd：错误：execve()：：没有此文件或目录

我在 AWS 上安装了突发事件驱动的 HPC 集群，使用泥浆根据本教程。

通过此安装，我可以在 EC2 上的 Slurm 环境中突发实例并运行作业。运行后：

#!/bin/bash
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=1
#SBATCH --constraint=[us-east-1a]

$sinfo返回：

PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
all*         up   infinite      2   idle ip-10-0-1-[6-7]
gpu          up   infinite      2   idle ip-10-0-1-[6-7]

当我尝试运行系统命令时，hostname我会得到来自节点的响应，但是当我尝试执行如下简单的自定义可执行文件helloworld时C：

$srun --export=ALL -N 2 -n 2 ./helloworld

它返回：

Exited with exit code 2
slurmstepd: error: execve(): /home/centos/./helloworld: No such file or directory.

我需要设置什么才能正确提交我的自定义作业？

答案1

可执行文件不会自动复制到所有节点，就像我之前处理过的集群一样。我必须明确地告诉srun它这样做。

srun --export=ALL --bcast=/home/centos/helloworld -N 2 -n 2 helloworld

复制可执行文件并在节点上执行。或者您也可以sbcast在 Bash 中使用。

答案1

相关内容