无法在集群头节点上运行 linpack

无法在集群头节点上运行 linpack

我最近建立了自己的家庭集群 - 4 台树莓派。但我在尝试使用 Linpack 对所有 4 个单元进行基准测试时遇到问题

其中一个节点是名为 rpislave1 的头节点,它使用 wlan0 接口连接到互联网和我的本地 wifi 网络,同时使用其上的 eth0 连接到集群的内部 LAN。

另外 3 个节点是 rpislave2、rpislave3 和 rpislave4。每个都连接到头节点 - rpislave1 并通过 rpislave1 访问互联网。为了简单起见,这 3 个节点通过连接到 rpislave1 的闪存驱动器进行网络启动。

所有设备均已通过 dhcp 使用其 MAC 地址分配了自己的 IP 地址。

这是头节点的 /etc/hosts 文件

127.0.0.1       localhost
::1             localhost ip6-localhost ip6-loopback
ff02::1         ip6-allnodes
ff02::2         ip6-allrouters

127.0.1.1 cluster

192.168.50.1    rpislave1 cluster
192.168.50.11   rpislave2
192.168.50.12   rpislave3
192.168.50.13   rpislave4

所有设备都可以通过 ssh 从 rpislave1 访问,无需密码,并在 /sharezone 共享一个 NFS 驱动器 - 该驱动器连接到安装在 rpislave1 上的拇指驱动器。

我对学习体验非常满意,并决定使用 HPL 或 linpack 对集群 -rpislave1、rpislave2、rpislave3 和 rpislave4 的总处理进行基准测试https://www.netlib.org/benchmark/hpl/

我开始在头节点 -rpislave1 上安装 OpenMPI。

它以 15 GFlops 的速度运行自己的时钟——当然没有什么值得夸耀的,但它很有趣。然后,我继续在 rpislave2 上设置 linpack 和 openmpi,并对其余单元(rpislave3 和 rpislave4)进行独立测试等。

所以我决定尝试在 2 个节点上运行它 - rpislave1 和 rpislave2。

这是我用于 2 个节点的 HPL.dat,但我不认为问题出在我正在使用的 HPL.dat 上。

HPLinpack benchmark input file
Innovative Computing Laboratory, University of Tennessee
HPL.out      output file name (if any) 
6            device out (6=stdout,7=stderr,file)
1            # of problems sizes (N)
40704         Ns
1            # of NBs
192           NBs
0            PMAP process mapping (0=Row-,1=Column-major)
1            # of process grids (P x Q)
2            Ps
4            Qs
16.0         threshold
1            # of panel fact
2            PFACTs (0=left, 1=Crout, 2=Right)
1            # of recursive stopping criterium
4            NBMINs (>= 1)
1            # of panels in recursion
2            NDIVs
1            # of recursive panel fact.
1            RFACTs (0=left, 1=Crout, 2=Right)
1            # of broadcast
1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)
1            # of lookahead depth
1            DEPTHs (>=0)
2            SWAP (0=bin-exch,1=long,2=mix)
64           swapping threshold
0            L1 in (0=transposed,1=no-transposed) form
0            U  in (0=transposed,1=no-transposed) form
1            Equilibration (0=no,1=yes)
8            memory alignment in double (> 0)
##### This line (no. 32) is ignored (it serves as a separator). ######
0                               Number of additional problem sizes for PTRANS
1200 10000 30000                values of N
0                               number of additional blocking sizes for PTRANS
40 9 8 13 13 20 16 32 64        values of NB

甚至制作一个主机文件来使用它

user@rpislave1:/sharezone/hpl $ cat host_file
rpislave1 slots=4
rpislave2 slots=4

这是我使用的命令:

 time mpirun -hostfile host_file -np 8 /sharezone/xhpl/bin/xhpl

但我得到的输出是这样的

user@rpislave1:/sharezone/hpl $ time mpirun -hostfile host_file -np 8 /sharezone/xhpl/bin/xhpl
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   40704
NB     :     192
PMAP   : Row-major process mapping
P      :       2
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

--------------------------------------------------------------------------
Open MPI detected an inbound MPI TCP connection request from a peer
that appears to be part of this MPI job (i.e., it identified itself as
part of this Open MPI job), but it is from an IP address that is
unexpected.  This is highly unusual.

The inbound connection has been dropped, and the peer should simply
try again with a different IP interface (i.e., the job should
hopefully be able to continue).

  Local host:          rpislave2
  Local PID:           1574
  Peer hostname:       rpislave1 ([[58941,1],2])
  Source IP of socket: 192.168.50.1
  Known IPs of peer:
        169.254.131.47
--------------------------------------------------------------------------

我不知道是什么导致了这个问题,但我注意到如果在 rpislave2、rpislave3 或 rpislave4 或 2 的任何组合上运行 linpack 测试,它将毫无问题地工作。

因为它我无法在头节点 rpislave1 上运行。

我已经花了几天时间尝试各种步骤,我怀疑开放 MPI 正在访问我在头节点上的 wlan0 以连接到本地 wifi 网络,所以我尝试使用“--mca btl_tcp_if_exclude wlan0”或任何类型mca 选项但没有任何效果。我什至浏览了 github 问题,但似乎所有问题都已修复,我应该有最新的补丁。这是我拥有的 openmpi 版本

user@rpislave1:/sharezone/hpl $ sudo apt install openmpi-bin openmpi-common libopenmpi3 libopenmpi-dev
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libopenmpi-dev is already the newest version (4.1.0-10).
libopenmpi3 is already the newest version (4.1.0-10).
openmpi-bin is already the newest version (4.1.0-10).
openmpi-common is already the newest version (4.1.0-10).
0 upgraded, 0 newly installed, 0 to remove and 0 not upgraded.
user@rpislave1:/sharezone/hpl $

有谁知道是什么导致了“Open MPI 检测到来自对等方的入站 MPI TCP 连接请求,该请求似乎是此 MPI 作业的一部分”错误?我怀疑它可能与 wlan0 接口有关,因为它显示了这个

 Known IPs of peer:
        169.254.131.47

跟踪路由显示此结果

user@rpislave1:/sharezone/hpl $ traceroute 169.254.131.47
traceroute to 169.254.131.47 (169.254.131.47), 30 hops max, 60 byte packets
 1  rpislave1.local (169.254.131.47)  0.192 ms  0.107 ms  0.096 ms
user@rpislave1:/sharezone/hpl $

这是 rpislave1/头节点的 ifconfig

user@rpislave1:/sharezone/hpl $ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.50.1  netmask 255.255.255.0  broadcast 192.168.50.255
        inet6 fe80::d314:681c:2e82:d5bc  prefixlen 64  scopeid 0x20<link>
        ether d8:3a:dd:1d:92:15  txqueuelen 1000  (Ethernet)
        RX packets 962575  bytes 911745808 (869.5 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 590397  bytes 382892062 (365.1 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING>  mtu 65536
        inet 127.0.0.1  netmask 255.0.0.0
        inet6 ::1  prefixlen 128  scopeid 0x10<host>
        loop  txqueuelen 1000  (Local Loopback)
        RX packets 3831  bytes 488990 (477.5 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 3831  bytes 488990 (477.5 KiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlan0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.101.15  netmask 255.255.255.0  broadcast 192.168.101.255
        inet6 2001:f40:950:b164:806e:1571:b836:23a4  prefixlen 64  scopeid 0x0<global>
        inet6 fe80::1636:9990:bd05:dd05  prefixlen 64  scopeid 0x20<link>
        ether d8:3a:dd:1d:92:16  txqueuelen 1000  (Ethernet)
        RX packets 44632  bytes 12764596 (12.1 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 74151  bytes 13143926 (12.5 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

user@rpislave1:/sharezone/hpl $

我非常感谢任何有关解决此问题的帮助。

答案1

我终于找到了问题并解决了它。

我发现有东西给 rpislave1 上的 eth0 提供了 2 个 IP 地址 - 192.168.50.1(这是来自我在 rpislave1 上设置的 DHCP 服务器)和另一个 169.254.131.47 - 这是导致 Open MPI 在 mpirun 期间搞砸的地址。问题是——这是什么过程?因为rpislave1是所有树莓派内部网络的DHCP服务器(范围192.168.50.0/24),也是与外界的连接,所以我以为是net.ipv4.ip_forward=1和 iptables - 不,做了一些测试。一点效果都没有。

然后我研究了为什么树莓派在同一个网络接口上有2个IP地址,我发现了dhcpcd.service。

user@rpislave1:/etc $ sudo systemctl status  dhcpcd.service
● dhcpcd.service - DHCP Client Daemon
     Loaded: loaded (/lib/systemd/system/dhcpcd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/dhcpcd.service.d
             └─wait.conf
     Active: active (running) since Fri 2023-07-28 16:22:45 +08; 16min ago
       Docs: man:dhcpcd(8)
    Process: 473 ExecStart=/usr/sbin/dhcpcd -w -q (code=exited, status=0/SUCCESS)
      Tasks: 2 (limit: 8755)
        CPU: 685ms
     CGroup: /system.slice/dhcpcd.service
             ├─564 wpa_supplicant -B -c/etc/wpa_supplicant/wpa_supplicant.conf -iwlan0
             └─765 dhcpcd: [master] [ip4] [ip6]

Jul 28 16:22:42 rpislave1 dhcpcd[473]: eth0: probing for an IPv4LL address
Jul 28 16:22:45 rpislave1 dhcpcd[473]: forked to background, child pid 765
Jul 28 16:22:45 rpislave1 systemd[1]: Started DHCP Client Daemon.
Jul 28 16:22:45 rpislave1 dhcpcd[765]: wlan0: leased 192.168.101.15 for 259200 seconds
Jul 28 16:22:45 rpislave1 dhcpcd[765]: wlan0: adding route to 192.168.101.0/24
Jul 28 16:22:45 rpislave1 dhcpcd[765]: wlan0: adding default route via 192.168.101.1
Jul 28 16:22:47 rpislave1 dhcpcd[765]: eth0: using IPv4LL address 169.254.131.47
Jul 28 16:22:47 rpislave1 dhcpcd[765]: eth0: adding route to 169.254.0.0/16
Jul 28 16:22:50 rpislave1 dhcpcd[765]: eth0: offered 192.168.50.1 from 192.168.50.1
Jul 28 16:22:51 rpislave1 dhcpcd[765]: eth0: no IPv6 Routers available

在发送 DHCP 请求之前找到为自身分配 IPv4LL 的服务(顺便说一句,该服务位于同一设备上)。此 IPv4LL 在从运行在同一服务器上的 DHCP 服务器获取 IP 地址之前设置 IP 地址和路由。但它似乎没有删除 169.254.0.0/16 的路由(我怀疑它被重新用于 rpislave.local,但这与我的目的无关)。

查看 dhcpcd.service 我发现了这个选项

Local Link configuration
If dhcpcd failed to obtain a lease, it probes for a valid IPv4LL address (aka ZeroConf, aka APIPA). Once obtained it restarts the process of looking for a DHCP server to get a proper address.

When using IPv4LL, dhcpcd nearly always succeeds and returns an exit code of 0. In the rare case it fails, it normally means that there is a reverse ARP proxy installed which always defeats IPv4LL probing. To disable this behaviour, you can use the -L, --noipv4ll option.

因此编辑了 /etc/systemd/system/dhcpcd.service.d 中的 wait.conf 文件 - 所有这些详细信息都可以在上面的状态代码中找到。

[Service]
ExecStart=
ExecStart=/usr/sbin/dhcpcd -w -q --noipv4ll

重新启动并再次检查 dhcpcd

user@rpislave1:/etc/systemd/system/dhcpcd.service.d $ sudo systemctl status  dhcpcd.service
● dhcpcd.service - DHCP Client Daemon
     Loaded: loaded (/lib/systemd/system/dhcpcd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/dhcpcd.service.d
             └─wait.conf
     Active: active (running) since Fri 2023-07-28 16:44:15 +08; 3min 17s ago
       Docs: man:dhcpcd(8)
    Process: 472 ExecStart=/usr/sbin/dhcpcd -w -q --noipv4ll (code=exited, status=0/SUCCESS)
      Tasks: 2 (limit: 8755)
        CPU: 619ms
     CGroup: /system.slice/dhcpcd.service
             ├─568 wpa_supplicant -B -c/etc/wpa_supplicant/wpa_supplicant.conf -iwlan0
             └─775 dhcpcd: [master] [ip4] [ip6]

Jul 28 16:44:10 rpislave1 dhcpcd[472]: wlan0: adding route to 2001:f40:950:b164::/64
Jul 28 16:44:10 rpislave1 dhcpcd[472]: wlan0: requesting DHCPv6 information
Jul 28 16:44:10 rpislave1 dhcpcd[472]: wlan0: adding default route via fe80::101
Jul 28 16:44:15 rpislave1 dhcpcd[472]: wlan0: leased 192.168.101.15 for 259200 seconds
Jul 28 16:44:15 rpislave1 dhcpcd[472]: wlan0: adding route to 192.168.101.0/24
Jul 28 16:44:15 rpislave1 dhcpcd[472]: wlan0: adding default route via 192.168.101.1
Jul 28 16:44:15 rpislave1 dhcpcd[472]: forked to background, child pid 775
Jul 28 16:44:15 rpislave1 systemd[1]: Started DHCP Client Daemon.
Jul 28 16:44:19 rpislave1 dhcpcd[775]: eth0: offered 192.168.50.1 from 192.168.50.1
Jul 28 16:44:21 rpislave1 dhcpcd[775]: eth0: no IPv6 Routers available
user@rpislave1:/etc/systemd/system/dhcpcd.service.d $

并且根本没有设置“eth0:使用 IPv4LL 地址 169.254.131.47”的路由。

所以我开始测试整个集群

user@rpislave1:/sharezone/ClusterProcessing/HPL $  time mpirun -hostfile ../host_file -np 16 /sharezone/xhpl/bin/xhpl |tee HPL.log
================================================================================
HPLinpack 2.3  --  High-Performance Linpack benchmark  --   December 2, 2018
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   57600
NB     :     192
PMAP   : Row-major process mapping
P      :       4
Q      :       4
PFACT  :   Right
NBMIN  :       4
NDIV   :       2
RFACT  :   Crout
BCAST  :  1ringM
DEPTH  :       1
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

--------------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be               1.110223e-16
- Computational tests pass if scaled residuals are less than                16.0

================================================================================
T/V                N    NB     P     Q               Time                 Gflops
--------------------------------------------------------------------------------
WR11C2R4       57600   192     4     4            2558.14             4.9804e+01
HPL_pdgesv() start time Fri Jul 28 16:55:04 2023

HPL_pdgesv() end time   Fri Jul 28 17:37:42 2023

--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=   3.39874554e-03 ...... PASSED
================================================================================

Finished      1 tests with the following results:
              1 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------

End of Tests.
================================================================================
[rpislave4:00682] PMIX ERROR: NO-PERMISSIONS in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 237
[rpislave3:00688] PMIX ERROR: NO-PERMISSIONS in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 237
[rpislave2:00696] PMIX ERROR: NO-PERMISSIONS in file ../../../../../../src/mca/common/dstore/dstore_base.c at line 237

real    43m33.445s
user    141m36.616s
sys     28m21.561s
user@rpislave1:/sharezone/ClusterProcessing/HPL $

而且还有效~!现在我可以对包括头部在内的所有 4 个单元进行完整的基准测试了~!

相关内容