英特尔 MPI 问题

英特尔 MPI 问题

我刚刚安装了 intel oneApi,并且 intel MPI 已包含在其中。我现在可以使用 编译我的 fortran 代码mpiifort,但是当我尝试使用 intel 运行它时mpirun,仅使用 `` 它会抛出此错误。

libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
[1628187174.213489] [localhost:38060:0]         select.c:406  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
[1628187174.213511] [localhost:38061:0]         select.c:406  UCX  ERROR no active messages transport to <no debug data>: self/self - Destination is unreachable, rdmacm/sockaddr - no am bcopy
Abort(1091215) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1169)..............:
MPIDI_OFI_mpi_init_hook(1909): OFI get address vector map failed
Abort(1091215) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Init: Other MPI error, error stack:
MPIR_Init_thread(138)........:
MPID_Init(1169)..............:
MPIDI_OFI_mpi_init_hook(1909): OFI get address vector map failed

现在,按照这里的建议https://github.com/openucx/ucx/issues/4742#issuecomment-584059909,我设置了这些环境变量export UCX_TLS=ud,sm,self,现在可执行文件运行但也抛出此错误,

libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
libi40iw-i40iw_vmapped_qp: failed to pin memory for SQ
libi40iw-i40iw_ucreate_qp: failed to map QP
[1628186868.761953] [localhost:35174:0]            sys.c:618  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'
[1628186868.793554] [localhost:35173:0]            sys.c:618  UCX  ERROR shmget(size=2097152 flags=0xfb0) for mm_recv_desc failed: Operation not permitted, please check shared memory limits by 'ipcs -l'

输出为ipcs -l


------ Messages Limits --------
max queues system wide = 32000
max size of message (bytes) = 8192
default max size of queue (bytes) = 16384

------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 18014398509465599
max total shared memory (kbytes) = 18014398442373116
min seg size (bytes) = 1

------ Semaphore Limits --------
max number of arrays = 128
max semaphores per array = 250
max semaphores system wide = 32000
max ops per semop call = 32
semaphore max value = 32767

我不明白问题是什么以及如何解决这个问题。有人可以帮我吗?

相关内容