我正在尝试在运行 ubuntu 16.04 的集群中安装 slurm。
我使用的是intel mpi,安装目录位于头节点/opt/intel/impi_5.01。
根据slurm指令,需要导出libpmi.so变量。https://slurm.schedmd.com/mpi_guide.html#intel_mpi
但是,我通过 ubuntu 安装了 slurm-llnl
sudo apt-get slurm-llnl
我不知道 libpmi.so 在哪里?所以我搜索了一下,在这里找到了一个文件,这是我要找的文件吗?
/usr/lib/x86_64-linux-gnu/libpmi.so
无论如何,我导出变量并尝试
srun -p old -N3 -n24 hostname
它返回,
rolly@head:~$ srun -p old -N3 -n24 hostname
node02
node02
node02
node02
node02
node02
node02
node02
node01
node01
head
head
node01
head
head
head
node01
node01
head
node01
head
head
node01
node01
看起来它正在起作用。
但当我执行任务时,
srun -p old -N3 -n24 ~/QE530-CPU/espresso-5.3.0/bin/pw.x
它产生了错误,
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node02: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
mpiexec_node01: cannot connect to local mpd (/tmp/mpd2.console_rolly); possible causes:
1. no mpd is running on this host
2. an mpd is running but was started without a "console" (-n option)
我相信错误提示是由于使用 intel-mpi 运行 mpiexec 造成的,应该使用 mpirun。
我该如何纠正这个问题?
谢谢!
答案1
我找到了解决方案。
1)sudo apt-get install mpich
2)srun --mpi=pmi2
3)mkl和intel相关的环境变量正确加载。
我希望这能够帮助遇到类似问题的人。