ubuntu 16.04 slurm srun 与 intel mpi 配合使用失败?

ubuntu 16.04 slurm srun 与 intel mpi 配合使用失败?

我正在尝试在运行 ubuntu 16.04 的集群中安装 slurm。

我使用的是intel mpi,安装目录位于头节点/opt/intel/impi_5.01。

根据slurm指令,需要导出libpmi.so变量。https://slurm.schedmd.com/mpi_guide.html#intel_mpi

但是,我通过 ubuntu 安装了 slurm-llnl

sudo apt-get slurm-llnl

我不知道 libpmi.so 在哪里?所以我搜索了一下,在这里找到了一个文件,这是我要找的文件吗?

/usr/lib/x86_64-linux-gnu/libpmi.so

无论如何,我导出变量并尝试

srun -p old -N3 -n24 hostname

它返回,

rolly@head:~$ srun -p old -N3 -n24 hostname
node02
node02
node02
node02
node02
node02
node02
node02
node01
node01
head
head
node01
head
head
head
node01
node01
head
node01
head
head
node01
node01

看起来它正在起作用。

但当我执行任务时,

srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x

它产生了错误,

mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
  1. no mpd is running on this host
  2. an mpd is running but was started without a "console" (-n option)

我相信错误提示是由于使用 intel-mpi 运行 mpiexec 造成的,应该使用 mpirun。

我该如何纠正这个问题?

谢谢!

答案1

我找到了解决方案。

1)sudo apt-get install mpich

2)srun --mpi=pmi2

3)mkl和intel相关的环境变量正确加载。

我希望这能够帮助遇到类似问题的人。

相关内容