在 LAN 和虚拟 LAN 节点上使用 OpenMPI 运行分布式 Pytorch

2024-6-19 • tag-icon

在 LAN 和虚拟 LAN 节点上使用 OpenMPI 运行分布式 Pytorch

我有两个 Ubuntu 节点，已从源代码安装了分布式 PyTorch 和异构 OpenMPI。它们都可以通过无密码 SSH 相互连接，并且有一个共享 NFS 目录 (home/nvidia/shared)，其中包含要通过 OpenMPI 执行的简单 PyTorch 脚本 (distmpi.py)。

节点-1 (xpa)：LAN 网络接口 enp4s0 上的一台台式电脑，其 IP 为 192.168.201.23（IP 地址） Node-2（异质）：OpenStack 中的虚拟机在 vLAN 接口 ens3 上具有虚拟 IP 11.11.11.21 和浮动 IP 192.168.200.151（IP 地址，是否配置）

启动 mpirun 从 XPS 运行 2 个进程（1 个在 192.168.201.23 上，另一个在 192.168.200.151 上）时出现以下错误

(torch) nvidia@xps:~$ mpirun -v -np 2 -H 192.168.201.23:1,192.168.200.151 torch/bin/python shared/distmpi.py
--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          xps
  Local PID:           7113
  Peer hostname:       192.168.200.151 ([[55343,1],1])
  Source IP of socket: 192.168.200.151
  Known IPs of peer:   
    11.11.11.21
--------------------------------------------------------------------------
[xps][[55343,1],0][btl_tcp_endpoint.c:796:mca_btl_tcp_endpoint_complete_connect] connect() to 11.11.11.21 failed: Connection timed out (110)

请查看 python 脚本，例如 distmpi.py，以供参考：

#!/usr/bin/env python
import os
import socket
import torch
import torch.distributed as dist
from torch.multiprocessing import Process


def run(rank, size):
    tensor = torch.zeros(size)
    print(f"I am {rank} of {size} with tensor {tensor}")

    # incrementing the old tensor
    tensor += 1

    # sending tensor to next rank
    if rank == size-1:
       dist.send(tensor=tensor, dst=0)
    else:
       dist.send(tensor=tensor, dst=rank+1)

    # receiving tensor from previous rank
    if rank == 0:
        dist.recv(tensor=tensor, src=size-1)
    else:
        dist.recv(tensor=tensor, src=rank-1)

    print('Rank ', rank, ' has data ', tensor[0])
    pass


def init_processes(rank, size, hostname, fn, backend='mpi'):
    """ Initialize the distributed environment. """
    dist.init_process_group(backend, rank=rank, world_size=size)
    fn(rank, size)


if __name__ == "__main__":
    world_size = int(os.environ['OMPI_COMM_WORLD_SIZE'])
    world_rank = int(os.environ['OMPI_COMM_WORLD_RANK'])
    hostname = socket.gethostname()
    init_processes(world_rank, world_size, hostname, run, backend='mpi')

问候。

相关内容