我有多台服务器,服务器上装有以太网控制器,并且在 PCI 插槽中安装了 InfiniBand 控制器。
问题是,当我重新启动 openibd.service(它应该只控制 infiniband 适配器)时,由于某种原因,我的以太网网络也重新启动了。
如果我停止 openibd,我的以太网也会停止。
以太网和 InfiniBand 应该彼此分离和独立。
我需要能够在不断开以太网连接的情况下停止或重新启动 openibd.service
操作系统:AlmaLinux 8.7
正在使用的以太网端口(1gb):eno2np1
Ofed 版本:MLNX_OFED_LINUX-5.9-0.5.6.0
重新启动 openibd.service 时,以太网连接断开,直到 openibd 再次运行。
我怀疑两张卡都使用相同的驱动程序,但我不确定如何继续。
所有卡上的固件均已更新。
./mlxfwmanager_LeSI_23B_OFED-23.04-1_build4_fw_update_aug_2023 --查询:
Querying Mellanox devices firmware ...
Device #1:
----------
Device Type: ConnectX4LX
Part Number: Lenovo_Ultron_CX4Lx_2P_25GbE_1G-BaseT_Ax
Description: Lenovo Ultron ConnectX-4 Lx LOM 25GbE and 1G-BaseT
PSID: LNV0000000028
PCI Device Name: 0000:65:00.0
Base MAC: 088fc3a3cb9e
Versions: Current Available
FW 14.32.1010 14.32.1010
PXE 3.6.0502 3.6.0502
UEFI 14.25.0017 14.25.0017
Status: Up to date
Device #2:
----------
Device Type: ConnectX6
Part Number: SC57A40943_Ax
Description: ThinkSystem Mellanox ConnectX-6 HDR100/100GbE QSFP56 1-port VPI Adapter
PSID: LNV0000000016
PCI Device Name: 0000:17:00.0
Base GUID: 946dae030049bd14
Versions: Current Available
FW 20.37.1014 20.37.1014
PXE 3.7.0102 3.7.0102
UEFI 14.30.0013 14.30.0013
Status: Up to date
ethtool eno2np1:
Settings for eno2np1:
Supported ports: [ ]
Supported link modes: 1000baseKX/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: None RS BASER
Advertised link modes: 1000baseKX/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: None RS BASER
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: None
PHYAD: 0
Transceiver: internal
Supports Wake-on: g
Wake-on: g
Current message level: 0x00000004 (4)
link
Link detected: yes
eno2np1 ib0:
Settings for ib0:
Supported ports: [ ]
Supported link modes: Not reported
Supported pause frame use: No
Supports auto-negotiation: No
Supported FEC modes: Not reported
Advertised link modes: Not reported
Advertised pause frame use: No
Advertised auto-negotiation: No
Advertised FEC modes: Not reported
Speed: 100000Mb/s
Duplex: Full
Auto-negotiation: off
Port: Other
PHYAD: 0
Transceiver: internal
Link detected: yes
使用 lspci-nnn 命令:
17:00.0 Infiniband controller [0207]: Mellanox Technologies MT28908 Family [ConnectX-6] [15b3:101b]
65:00.0 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
65:00.1 Ethernet controller [0200]: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] [15b3:1015]
lshw -C 网络:
*-network
description: interface
product: MT28908 Family [ConnectX-6]
vendor: Mellanox Technologies
physical id: 0
bus info: pci@0000:17:00.0
logical name: ib0
version: 00
serial: 00:00:0a:81:fe:80:00:00:00:00:00:00:94:6d:00:00:00:00:00:00
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list rom physical
configuration: autonegotiation=off broadcast=yes driver=mlx5_core[ib_ipoib] driverversion=5.9-0.5.5 duplex=full firmware=20.37.1014 (LNV0000000016) ip=192.168.0.3 latency=0 link=yes multicast=yes
resources: iomemory:21f0-21ef irq:18 memory:21ffc000000-21ffdffffff memory:d4200000-d42fffff
*-network:0
description: Ethernet interface
product: MT27710 Family [ConnectX-4 Lx]
vendor: Mellanox Technologies
physical id: 0
bus info: pci@0000:65:00.0
logical name: eno1np0
version: 00
serial: 08:8f:c3:a3:cb:9e
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 firmware=14.32.1010 (LNV0000000028) latency=0 link=no multicast=yes
resources: iomemory:24f0-24ef irq:18 memory:24ffc000000-24ffdffffff memory:e3500000-e35fffff memory:24ffe800000-24ffeffffff
*-network:1
description: Ethernet interface
product: MT27710 Family [ConnectX-4 Lx]
vendor: Mellanox Technologies
physical id: 0.1
bus info: pci@0000:65:00.1
logical name: eno2np1
version: 00
serial: 08:8f:c3:a3:cb:9f
size: 1Gbit/s
width: 64 bits
clock: 33MHz
capabilities: pciexpress vpd msix pm bus_master cap_list rom ethernet physical autonegotiation
configuration: autonegotiation=on broadcast=yes driver=mlx5_core driverversion=5.9-0.5.5 duplex=full firmware=14.32.1010 (LNV0000000028) ip=10.0.26.3 latency=0 link=yes multicast=yes speed=1Gbit/s
resources: iomemory:24f0-24ef irq:19 memory:24ffa000000-24ffbffffff memory:e3400000-e34fffff memory:24ffe000000-24ffe7fffff
/var/log/消息:
systemd[1]: Stopping openibd - configure Mellanox devices...
root[8303]: openibd: running in manual mode
systemd[1]: /usr/lib/systemd/system/ibacm.service:22: Unknown lvalue 'ProtectHostname' in section 'Service'
systemd[1]: /usr/lib/systemd/system/ibacm.service:23: Unknown lvalue 'ProtectKernelLogs' in section 'Service'
NetworkManager[1345]: <info> [1692350943.3204] device (ib0): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
dbus-daemon[1341]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by ':1.1' (uid=0 pid=1345 comm="/usr/sbin/NetworkManager --no-daemon ")
systemd[1]: Starting Network Manager Script Dispatcher Service...
dbus-daemon[1341]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
systemd[1]: Started Network Manager Script Dispatcher Service.
systemd[1]: Stopping RDMA Node Description Daemon...
systemd[1]: rdma-ndd.service: Succeeded.
systemd[1]: Stopped RDMA Node Description Daemon.
NetworkManager[1345]: <info> [1692350945.4769] device (eno2np1): state change: activated -> unmanaged (reason 'removed', sys-iface-state: 'removed')
NetworkManager[1345]: <info> [1692350945.4912] dhcp4 (eno2np1): canceled DHCP transaction
NetworkManager[1345]: <info> [1692350945.4913] dhcp4 (eno2np1): activation: beginning transaction (timeout in 45 seconds)
NetworkManager[1345]: <info> [1692350945.4913] dhcp4 (eno2np1): state changed no lease
NetworkManager[1345]: <info> [1692350945.4926] manager: NetworkManager state is now DISCONNECTED
我到目前为止尝试过
安装干净的操作系统
更新服务器的 UEFI 固件
更新 Mellanox 固件和 ofed\
答案1
作为openibd.service
重启过程的一部分,脚本将卸载并重新加载mlx5_core
模块,该模块作为 Mellanox / NVIDIA InfiniBand 和以太网卡(包括问题中列出的两张卡)的 PCIe 设备驱动程序。