重新启动 openibd (Infiniband) 服务时以太网连接断开

重新启动 openibd (Infiniband) 服务时以太网连接断开

我有多台服务器,服务器上装有以太网控制器,并且在 PCI 插槽中安装了 InfiniBand 控制器。

问题是,当我重新启动 openibd.service(它应该只控制 infiniband 适配器)时,由于某种原因,我的以太网网络也重新启动了。

如果我停止 openibd,我的以太网也会停止。

以太网和 InfiniBand 应该彼此分离和独立。

我需要能够在不断开以太网连接的情况下停止或重新启动 openibd.service

操作系统:AlmaLinux 8.7

正在使用的以太网端口(1gb):eno2np1

Ofed 版本:MLNX_OFED_LINUX-5.9-0.5.6.0

重新启动 openibd.service 时,以太网连接断开,直到 openibd 再次运行。
我怀疑两张卡都使用相同的驱动程序,但我不确定如何继续。

所有卡上的固件均已更新。

./mlxfwmanager_LeSI_23B_OFED-23.04-1_build4_fw_update_aug_2023 --查询:

Querying Mellanox devices firmware ...

Device #1:
----------

  Device Type:      ConnectX4LX
  Part Number:      Lenovo_Ultron_CX4Lx_2P_25GbE_1G-BaseT_Ax
  Description:      Lenovo Ultron ConnectX-4 Lx LOM 25GbE and 1G-BaseT
  PSID:             LNV0000000028
  PCI Device Name:  0000:65:00.0
  Base MAC:         088fc3a3cb9e
  Versions:         Current        Available
     FW             14.32.1010     14.32.1010
     PXE            3.6.0502       3.6.0502
     UEFI           14.25.0017     14.25.0017

  Status:           Up to date

Device #2:
----------

  Device Type:      ConnectX6
  Part Number:      SC57A40943_Ax
  Description:      ThinkSystem Mellanox ConnectX-6 HDR100/100GbE QSFP56 1-port VPI Adapter
  PSID:             LNV0000000016
  PCI Device Name:  0000:17:00.0
  Base GUID:        946dae030049bd14
  Versions:         Current        Available
     FW             20.37.1014     20.37.1014
     PXE            3.7.0102       3.7.0102
     UEFI           14.30.0013     14.30.0013

  Status:           Up to date

ethtool eno2np1:

Settings for eno2np1:
        Supported ports: [  ]
        Supported link modes:   1000baseKX/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: Yes
        Supported FEC modes: None        RS      BASER
        Advertised link modes:  1000baseKX/Full
        Advertised pause frame use: Symmetric
        Advertised auto-negotiation: Yes
        Advertised FEC modes: None       RS      BASER
        Speed: 1000Mb/s
        Duplex: Full
        Auto-negotiation: on
        Port: None
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: g
        Wake-on: g
        Current message level: 0x00000004 (4)
                               link
        Link detected: yes

eno2np1 ib0:

Settings for ib0:
        Supported ports: [  ]
        Supported link modes:   Not reported
        Supported pause frame use: No
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes:  Not reported
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 100000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: Other
        PHYAD: 0
        Transceiver: internal
        Link detected: yes

使用 lspci-nnn 命令:

17:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b]
65:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
65:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]

lshw -C 网络:

  *-network
       description: interface
       product: MT28908 Family [ConnectX-6]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:17:00.0
       logical name: ib0
       version: 00
       serial: 00:00:0a:81:fe:80:00:00:00:00:00:00:94:6d:00:00:00:00:00:00
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom physical
       configuration: autonegotiation=off broadcast=yes driver=mlx5_core[ib_ipoib] driverversion=5.9-0.5.5 duplex=full firmware=20.37.1014 (LNV0000000016) ip=192.168.0.3 latency=0 link=yes multicast=yes
       resources: iomemory:21f0-21ef irq:18 memory:21ffc000000-21ffdffffff memory:d4200000-d42fffff
  *-network:0
       description: Ethernet interface
       product: MT27710 Family [ConnectX-4 Lx]
       vendor: Mellanox Technologies
       physical id: 0
       bus info: pci@0000:65:00.0
       logical name: eno1np0
       version: 00
       serial: 08:8f:c3:a3:cb:9e
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 firmware=14.32.1010 (LNV0000000028) latency=0 link=no multicast=yes
       resources: iomemory:24f0-24ef irq:18 memory:24ffc000000-24ffdffffff memory:e3500000-e35fffff memory:24ffe800000-24ffeffffff
  *-network:1
       description: Ethernet interface
       product: MT27710 Family [ConnectX-4 Lx]
       vendor: Mellanox Technologies
       physical id: 0.1
       bus info: pci@0000:65:00.1
       logical name: eno2np1
       version: 00
       serial: 08:8f:c3:a3:cb:9f
       size: 1Gbit/s
       width: 64 bits
       clock: 33MHz
       capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
       configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 duplex=full firmware=14.32.1010 (LNV0000000028) ip=10.0.26.3 latency=0 link=yes multicast=yes speed=1Gbit/s
       resources: iomemory:24f0-24ef irq:19 memory:24ffa000000-24ffbffffff memory:e3400000-e34fffff memory:24ffe000000-24ffe7fffff

/var/log/消息:

systemd[1]: Stopping openibd - configure Mellanox devices...
root[8303]: openibd: running in manual mode
systemd[1]: /usr/lib/systemd/system/ibacm.service:22: Unknown lvalue 'ProtectHostname' in section 'Service'
systemd[1]: /usr/lib/systemd/system/ibacm.service:23: Unknown lvalue 'ProtectKernelLogs' in section 'Service'
NetworkManager[1345]: <info>  [1692350943.3204] device (ib0): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
dbus-daemon[1341]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.1' (uid=0 pid=1345 comm="/usr/sbin/NetworkManager --no-daemon ")
systemd[1]: Starting Network Manager Script Dispatcher Service...
dbus-daemon[1341]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
systemd[1]: Started Network Manager Script Dispatcher Service.
systemd[1]: Stopping RDMA Node Description Daemon...
systemd[1]: rdma-ndd.service: Succeeded.
systemd[1]: Stopped RDMA Node Description Daemon.
NetworkManager[1345]: <info>  [1692350945.4769] device (eno2np1): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
NetworkManager[1345]: <info>  [1692350945.4912] dhcp4 (eno2np1): canceled DHCP transaction
NetworkManager[1345]: <info>  [1692350945.4913] dhcp4 (eno2np1): activation: beginning transaction (timeout in 45 seconds)
NetworkManager[1345]: <info>  [1692350945.4913] dhcp4 (eno2np1): state changed no lease
NetworkManager[1345]: <info>  [1692350945.4926] manager: NetworkManager state is now DISCONNECTED 

我到目前为止尝试过

安装干净的操作系统
更新服务器的 UEFI 固件
更新 Mellanox 固件和 ofed\

答案1

作为openibd.service重启过程的一部分,脚本将卸载并重新加载mlx5_core模块,该模块作为 Mellanox / NVIDIA InfiniBand 和以太网卡(包括问题中列出的两张卡)的 PCIe 设备驱动程序。

相关内容