我正在使用由 4 台机器(主机器、从机器 1、从机器 2、从机器 3 和从机器 4)组成的 HPC,我尝试在 HPC 结构上运行脚本:
mpirun -report-uri - -host master,slave1,slave2,slave3,slave4 --map-by node-np 50 hellompi
但我遇到了这个错误信息:
657129472.0;tcp://10.1.1.1,10.1.2.1,10.1.3.1,10.1.4.1:54761
[charlotte-ProLiant-DL380-Gen10-slave1:07172] [[10027,0],1] tcp_peer_send_blocking: send() to socket 9 failed: Broken pipe (32)
ORTE was unable to reliably start one or more daemons.
This usually is caused by:
* not finding the required libraries and/or binaries on
one or more nodes. Please check your PATH and LD_LIBRARY_PATH
settings, or configure OMPI with --enable-orterun-prefix-by-default
* lack of authority to execute on one or more specified nodes.
Please verify your allocation and authorities.
* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
Please check with your sys admin to determine the correct location to use.
* compilation of the orted with dynamic libraries when static are required
(e.g., on Cray). Please check your configure cmd line and consider using
one of the contrib/platform definitions for your system type.
* an inability to create a connection back to mpirun due to a
lack of common network interfaces and/or no route found between
them. Please check network connectivity (including firewalls
and network routing requirements).
我在 Ubuntu 上工作。每台机器上的防火墙 (ufw) 都已禁用。即使在无密码模式下,ssh 登录也可以。每台机器上的 Mpirun 版本相同。Iptables 已启用。
我尝试运行的脚本是一个简单的 Fortran 代码:
program hello
include 'mpif.h'
integer rank, size, ierror, nl
character(len=MPI_MAX_PROCESSOR_NAME) :: hostname
call MPI_INIT(ierror)
call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
call MPI_GET_PROCESSOR_NAME(hostname, nl, ierror)
print*, 'node', rank, ' of', size, ' on ', hostname(1:nl), ': Hello world'
call MPI_FINALIZE(ierror)
end
如果我在几个节点中运行,它就会起作用:
mpirun -report-uri - --mca oob_tcp_if_include 10.1.1.0/24 -host master,slave1 --map-by node -np 4 hellompi
4211277824.0;tcp://10.1.1.1:49281
node 0 of 4 on charlotte-ProLiant-DL380-Gen10-master: Hello world
node 2 of 4 on charlotte-ProLiant-DL380-Gen10-master: Hello world
node 1 of 4 on charlotte-ProLiant-DL380-Gen10-slave1: Hello world
node 3 of 4 on charlotte-ProLiant-DL380-Gen10-slave1: Hello world
Master-Slave2、Master-Slave3、Master-Slave4 也一样
在 master 上,ifconfig 显示:
eno1 Link encap:Ethernet HWaddr 54:80:28:57:0f:7e
UP BROADCAST MULTICAST MTU:1500 Metric:1
Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
Interruption:16
eno2 Link encap:Ethernet HWaddr 54:80:28:57:0f:7f
UP BROADCAST MULTICAST MTU:1500 Metric:1
Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
Interruption:17
eno3 Link encap:Ethernet HWaddr 54:80:28:57:0f:80
UP BROADCAST MULTICAST MTU:1500 Metric:1
Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
Interruption:16
eno4 Link encap:Ethernet HWaddr 54:80:28:57:0f:81
UP BROADCAST MULTICAST MTU:1500 Metric:1
Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
Interruption:17
eno5 Link encap:Ethernet HWaddr 80:30:e0:31:b1:68
inet adr:10.1.3.1 Bcast:10.1.3.255 Masque:255.255.255.0
adr inet6: fe80::5cfb:416e:a702:7582/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Packets reçus:1038 erreurs:0 :0 overruns:0 frame:0
TX packets:966 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:186531 (186.5 KB) Octets transmis:106391 (106.3 KB)
Interruption:32 Mémoire:e7800000-e7ffffff
eno6 Link encap:Ethernet HWaddr 80:30:e0:31:b1:6c
inet adr:10.1.4.1 Bcast:10.1.4.255 Masque:255.255.255.0
adr inet6: fe80::9451:8431:7010:46/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Packets reçus:873 erreurs:0 :0 overruns:0 frame:0
TX packets:844 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:86934 (86.9 KB) Octets transmis:72778 (72.7 KB)
Interruption:144 Mémoire:e8800000-e8ffffff
ens2f0 Link encap:Ethernet HWaddr 20:67:7c:06:5f:a8
inet adr:10.1.1.1 Bcast:10.1.1.255 Masque:255.255.255.0
adr inet6: fe80::39c2:fdd5:930e:c253/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Packets reçus:2195 erreurs:0 :0 overruns:0 frame:0
TX packets:1425 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:1332614 (1.3 MB) Octets transmis:200100 (200.1 KB)
Interruption:28 Mémoire:e3000000-e37fffff
ens2f1 Link encap:Ethernet HWaddr 20:67:7c:06:5f:ac
inet adr:10.1.2.1 Bcast:10.1.2.255 Masque:255.255.255.0
adr inet6: fe80::91f5:53ce:378a:686e/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Packets reçus:1644 erreurs:0 :0 overruns:0 frame:0
TX packets:1385 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:379968 (379.9 KB) Octets transmis:211904 (211.9 KB)
Interruption:123 Mémoire:e4000000-e47fffff
ens5f0 Link encap:Ethernet HWaddr 20:67:7c:06:5f:a0
inet adr:10.0.0.2 Bcast:10.1.0.255 Masque:255.255.255.0
adr inet6: fe80::52e5:a943:831d:35f5/64 Scope:Lien
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
Packets reçus:9821 erreurs:0 :0 overruns:0 frame:0
TX packets:9230 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:983759 (983.7 KB) Octets transmis:2599111 (2.5 MB)
Interruption:34 Mémoire:f0000000-f07fffff
ens5f1 Link encap:Ethernet HWaddr 20:67:7c:06:5f:a4
UP BROADCAST MULTICAST MTU:1500 Metric:1
Packets reçus:0 erreurs:0 :0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:0 (0.0 B) Octets transmis:0 (0.0 B)
Interruption:165 Mémoire:f1000000-f17fffff
lo Link encap:Boucle locale
inet adr:127.0.0.1 Masque:255.0.0.0
adr inet6: ::1/128 Scope:Hôte
UP LOOPBACK RUNNING MTU:65536 Metric:1
Packets reçus:230476 erreurs:0 :0 overruns:0 frame:0
TX packets:230476 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 lg file transmission:1000
Octets reçus:411579801 (411.5 MB) Octets transmis:411579801 (411.5 MB)
从站 1:10.1.1.1(ens2f0),从站 2:10.1.2.1(ens2f1),从站 3:10.1.3.1(eno5),从站 4:10.1.4.1(eno6)。
有关我的 ompi 版本的信息是:
ompi_info
Package: Open MPI buildd@lgw01-57 Distribution
Open MPI: 1.10.2
Open MPI repo revision: v1.10.1-145-g799148f
Open MPI release date: Jan 21, 2016
Open RTE: 1.10.2
Open RTE repo revision: v1.10.1-145-g799148f
Open RTE release date: Jan 21, 2016
OPAL: 1.10.2
OPAL repo revision: v1.10.1-145-g799148f
OPAL release date: Jan 21, 2016
MPI API: 3.0.0
Ident string: 1.10.2
Prefix: /usr
Configured architecture: x86_64-pc-linux-gnu
Configure host: lgw01-57
Configured by: buildd
Configured on: Thu Feb 25 16:33:01 UTC 2016
Configure host: lgw01-57
Built by: buildd
Built on: Thu Feb 25 16:40:59 UTC 2016
Built host: lgw01-57
C bindings: yes
C++ bindings: yes
Fort mpif.h: yes (all)
Fort use mpi: yes (full: ignore TKR)
Fort use mpi size: deprecated-ompi-info-value
Fort use mpi_f08: yes
Fort mpi_f08 compliance: The mpi_f08 module is available, but due to
limitations in the gfortran compiler, does not
support the following: array subsections, direct
passthru (where possible) to underlying Open MPI's
C functionality
Fort mpi_f08 subarrays: no
Java bindings: no
Wrapper compiler rpath: runpath
C compiler: gcc
C compiler absolute: /usr/bin/gcc
C compiler family name: GNU
C compiler version: 5.3.1
C++ compiler: g++
C++ compiler absolute: /usr/bin/g++
Fort compiler: gfortran
Fort compiler abs: /usr/bin/gfortran
Fort ignore TKR: yes (!GCC$ ATTRIBUTES NO_ARG_CHECK ::)
Fort 08 assumed shape: yes
Fort optional args: yes
Fort INTERFACE: yes
Fort ISO_FORTRAN_ENV: yes
Fort STORAGE_SIZE: yes
Fort BIND(C) (all): yes
Fort ISO_C_BINDING: yes
Fort SUBROUTINE BIND(C): yes
Fort TYPE,BIND(C): yes
Fort T,BIND(C,name="a"): yes
Fort PRIVATE: yes
Fort PROTECTED: yes
Fort ABSTRACT: yes
Fort ASYNCHRONOUS: yes
Fort PROCEDURE: yes
Fort USE...ONLY: yes
Fort C_FUNLOC: yes
Fort f08 using wrappers: yes
Fort MPI_SIZEOF: yes
C profiling: yes
C++ profiling: yes
Fort mpif.h profiling: yes
Fort use mpi profiling: yes
Fort use mpi_f08 prof: yes
C++ exceptions: no
Thread support: posix (MPI_THREAD_MULTIPLE: no, OPAL support: yes,
OMPI progress: no, ORTE progress: yes, Event lib:
yes)
Sparse Groups: no
Internal debug support: no
MPI interface warnings: yes
MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
dl support: yes
Heterogeneous support: yes
mpirun default --prefix: no
MPI I/O support: yes
MPI_WTIME support: gettimeofday
Symbol vis. support: yes
Host topology support: yes
MPI extensions:
FT Checkpoint support: no (checkpoint thread: no)
C/R Enabled Debugging: no
VampirTrace support: no
MPI_MAX_PROCESSOR_NAME: 256
MPI_MAX_ERROR_STRING: 256
MPI_MAX_OBJECT_NAME: 64
MPI_MAX_INFO_KEY: 36
MPI_MAX_INFO_VAL: 256
MPI_MAX_PORT_NAME: 1024
MPI_MAX_DATAREP_STRING: 128
MCA backtrace: execinfo (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA compress: gzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA compress: bzip (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA crs: none (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA db: print (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA db: hash (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA dl: dlopen (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA event: libevent2021 (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA hwloc: external (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA if: posix_ipv4 (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA if: linux_ipv6 (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA installdirs: env (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA installdirs: config (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA memory: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pstat: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sec: basic (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA shmem: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA shmem: mmap (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA shmem: sysv (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA timer: linux (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA dfs: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA dfs: test (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA dfs: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA errmgr: default_tool (MCA v2.0.0, API v3.0.0, Component
v1.10.2)
MCA errmgr: default_app (MCA v2.0.0, API v3.0.0, Component
v1.10.2)
MCA errmgr: default_orted (MCA v2.0.0, API v3.0.0, Component
v1.10.2)
MCA errmgr: default_hnp (MCA v2.0.0, API v3.0.0, Component
v1.10.2)
MCA ess: singleton (MCA v2.0.0, API v3.0.0, Component
v1.10.2)
MCA ess: slurm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA ess: env (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA ess: tool (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA ess: hnp (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA filem: raw (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA grpcomm: bad (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: tool (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: orted (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: mr_hnp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA iof: mr_orted (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA odls: default (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA oob: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA plm: isolated (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA plm: rsh (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA plm: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA ras: gridengine (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA ras: loadleveler (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA ras: slurm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA ras: simulator (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA rmaps: round_robin (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA rmaps: mindist (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: seq (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: ppr (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: rank_file (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA rmaps: staged (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rmaps: resilient (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA rml: oob (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA routed: radix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA routed: debruijn (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA routed: direct (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA routed: binomial (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA state: orted (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: app (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: dvm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: staged_hnp (MCA v2.0.0, API v1.0.0, Component
v1.10.2)
MCA state: tool (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: hnp (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA state: staged_orted (MCA v2.0.0, API v1.0.0, Component
v1.10.2)
MCA state: novm (MCA v2.0.0, API v1.0.0, Component v1.10.2)
MCA allocator: bucket (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA allocator: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA bcol: basesmuma (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA bcol: ptpcoll (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA bml: r2 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: vader (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: tcp (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: openib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA btl: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: tuned (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: self (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: hierarch (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA coll: basic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: libnbc (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: ml (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA coll: inter (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA dpm: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fbtl: posix (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: ylib (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: individual (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA fcoll: static (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: dynamic (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA fcoll: two_phase (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA fs: ufs (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA io: ompio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA io: romio (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA mpool: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA mpool: grdma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA osc: pt2pt (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA osc: sm (MCA v2.0.0, API v3.0.0, Component v1.10.2)
MCA pml: v (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pml: ob1 (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pml: cm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pml: bfo (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA pubsub: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rcache: vma (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA rte: orte (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sbgp: basesmsocket (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA sbgp: basesmuma (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA sbgp: p2p (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sharedfp: sm (MCA v2.0.0, API v2.0.0, Component v1.10.2)
MCA sharedfp: individual (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA sharedfp: lockedfile (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
MCA topo: basic (MCA v2.0.0, API v2.1.0, Component v1.10.2)
MCA vprotocol: pessimist (MCA v2.0.0, API v2.0.0, Component
v1.10.2)
任何想法 ?
答案1
对于那些遇到类似问题的人,我在 OMPI 团队的帮助下部分解决了这个问题。请查看https://github.com/open-mpi/ompi/issues/6293了解详情。问题已锁定。