Intel NIC E810-C 连接不稳定 - 如何排除故障?

Intel NIC E810-C 连接不稳定 - 如何排除故障?

这是我之前问题的延续,网卡不稳定:如何排除故障?。网卡是:

# networkctl -a status
...
● 4: ens6f0                                                                                             
                     Link File: /usr/lib/systemd/network/99-default.link
                  Network File: n/a
                          Type: ether
                         State: n/a (unmanaged)
             Alternative Names: enp24s0f0
                          Path: pci-0000:18:00.0
                        Driver: ice
                        Vendor: Intel Corporation
                         Model: Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)
                    HW Address: 64:9d:99:ff:fe:c0 (FS COM INC)
                           MTU: 1500 (min: 68, max: 9702)
                         QDisc: mq
  IPv6 Address Generation Mode: eui64
          Queue Length (Tx/Rx): 320/320
              Auto negotiation: no
                         Speed: 100Gbps
                        Duplex: full
                          Port: fibre
                       Address: 192.168.50.7
                                fe80::669d:99ff:feff:fec0
                       Gateway: 192.168.50.1 (TP-LINK TECHNOLOGIES CO.,LTD.)
Failed to query link DHCP leases: Unit dbus-org.freedesktop.network1.service not found.

操作系统是:

# cat /etc/*release*
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

主板是:

# dmidecode -t baseboard
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.

Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
        Manufacturer: Supermicro
        Product Name: X12SPL-F
        Version: 2.00
        Serial Number: ZM224S007191
        Asset Tag: Base Board Asset Tag
        Features:
                Board is a hosting board
                Board is replaceable
        Location In Chassis: Part Component
        Chassis Handle: 0x0003
        Type: Motherboard
        Contained Object Handles: 0

NIC 位于唯一的 16 通道插槽中:

Handle 0x000F, DMI type 9, 17 bytes
System Slot Information
        Designation: CPU SLOT6 PCI-E 4.0 X16
        Type: x16 PCI Express 4 x16
        Current Usage: In Use
        Length: Long
        ID: 6
        Characteristics:
                3.3 V is provided
                Opening is shared
                PME signal is supported
        Bus Address: 0000:18:00.0

我遇到的问题是网卡不断掉线,原因不明,我不知道如何深入研究它。到目前为止,我所拥有的只是以下内容dmesg

# dmesg | grep 0000:18:00.0

[    0.754043] pci 0000:18:00.0: [8086:1592] type 00 class 0x020000
[    0.754056] pci 0000:18:00.0: reg 0x10: [mem 0x201ffa000000-0x201ffbffffff 64bit pref]
[    0.754070] pci 0000:18:00.0: reg 0x1c: [mem 0x201ffe010000-0x201ffe01ffff 64bit pref]
[    0.754080] pci 0000:18:00.0: reg 0x30: [mem 0xbb600000-0xbb6fffff pref]
[    0.754166] pci 0000:18:00.0: reg 0x184: [mem 0x201ffd000000-0x201ffd01ffff 64bit pref]
[    0.754168] pci 0000:18:00.0: VF(n) BAR0 space: [mem 0x201ffd000000-0x201ffdffffff 64bit pref] (contains BAR0 for 128 VFs)
[    0.754179] pci 0000:18:00.0: reg 0x190: [mem 0x201ffe220000-0x201ffe223fff 64bit pref]
[    0.754180] pci 0000:18:00.0: VF(n) BAR3 space: [mem 0x201ffe220000-0x201ffe41ffff 64bit pref] (contains BAR3 for 128 VFs)
[    0.754429] pci 0000:18:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:17:02.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    0.800984] pci 0000:18:00.0: CLS mismatch (64 != 32), using 64 bytes
[    1.369098] pci 0000:18:00.0: Adding to iommu group 31
[    1.819150] ice 0000:18:00.0: firmware: failed to load intel/ice/ddp/ice-e20070ffffd99fd0.pkg (-2)
[    1.819589] ice 0000:18:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg
[    2.140744] ice 0000:18:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.30.0
[    2.211858] ice 0000:18:00.0: PTP init successful
[    2.616387] ice 0000:18:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
[    2.616387] ice 0000:18:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
[    2.616492] ice 0000:18:00.0: Commit DCB Configuration to the hardware
[    2.618380] ice 0000:18:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:17:02.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[    2.621272] ice 0000:18:00.0 eth0: A parallel fault was detected.
[    2.621365] ice 0000:18:00.0 eth0: Possible Solution: Check link partner connection and configuration.
[    2.621513] ice 0000:18:00.0 eth0: Port Number: 1.
[    3.331319] ice 0000:18:00.0 ens6f0: renamed from eth0
[ 1052.057728] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
[2304065.370537] ice 0000:18:00.0 ens6f0: NIC Link is Down
[2304065.470757] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
[6567288.755539] ice 0000:18:00.0 ens6f0: Changing Rx descriptor count from 2048 to 8160
[10043828.294404] ice 0000:18:00.0 ens6f0: NIC Link is Down
[10043828.394033] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
[10198013.280727] ice 0000:18:00.0 ens6f0: NIC Link is Down
[10198013.381243] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None

但我不相信这就是真正的问题——它似乎发生的频率不足以解释我所看到的问题,而且它会在不到一秒的时间内再次出现。这些问题似乎确实与网络连接有关,例如:

root@pluto:/home/comind# ping knox
PING knox.comind.io (192.168.50.7) 56(84) bytes of data.
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=1 ttl=64 time=0.476 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=2 ttl=64 time=0.542 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=3 ttl=64 time=0.521 ms
...
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=26 ttl=64 time=0.544 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=27 ttl=64 time=0.554 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=34 ttl=64 time=0.539 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=35 ttl=64 time=0.402 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=36 ttl=64 time=0.539 ms
...

中断(每次的长度icmp_sec=27似乎icmp_sec=34在 7 秒左右,并且经常发生。我在终端会话中看到类似的情况 - 键盘输入似乎停止了几秒钟,然后显示在终端上;有时字符是最后,该服务器的 NFS 共享也会受到同样的延迟影响。

NFS服务由ganesha V3.4提供,日志包含多行,例如:

13/01/2023 01:09:46 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_946] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc37422f1b0 fd 10798 msg_iov 0x7fc2da2e0f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 06:26:54 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_887] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc2190609f0 fd 10386 msg_iov 0x7fc447406f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 08:06:33 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_967] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc1f42aec90 fd 10387 msg_iov 0x7fc2d8ac8f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 08:36:01 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_967] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc11c5ee4c0 fd 10388 msg_iov 0x7fc2d8ac8f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 08:38:04 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_1032] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc134b4f480 fd 10394 msg_iov 0x7fc38cde1f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 10:55:53 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_1032] rpc :TIRPC :EVENT :svc_vc_wait: 0x7fc1e8074320 fd 10603 recv errno 104 (will set dead)

同样,日志中没有足够的错误来解释频繁的延迟。

对我来说,很明显这是网络问题 - 服务器从 FS: 连接到交换机N5860-48SC,但不幸的是我对交换机上的故障排除了解不够。对于如何解决此问题的任何帮助、见解或建议,我将不胜感激。

答案1

当链路不稳定时,尤其是在光纤上,一个非常好的指标是您是否遇到本地故障或远程故障。

查看命令中的计数器:

ethtool -S ens6f0

并看到这样的东西:

$ ethtool -S ens259f0 |grep fault
     mac_local_faults.nic: 0
     mac_remote_faults.nic: 0

如果那里什么都没有,请抓取输出

ethtool -m ens6f0
ethtool -S ens6f0
ethtool -i ens6f0
devlink dev info

并仔细检查您正在运行可用的最新固件/NVM 映像。

故障排除时最后一个要查看的地方是交换机日志本身,以查看它是否指示您遇到的是本地(交换机端)还是远程(E810 端)故障。

如果 E810 显示本地故障,则故障排除应引导您联系支持人员并提供上面收集的一些信息。有很多可能是错误的,但遵循上面的一些基本步骤应该有助于隔离一些错误。

答案2

执行此操作时ethtool -m,检查是否有任何警报以及 RX/TX 功率水平是否在范围内,对于范围,您可以查看警报阈值。

不同的 SFP 模块的阈值可能有所不同。

答案3

我终于找到了问题所在:光学器件中有灰尘。不知何故,有人(没有提到名字,但这是一个离我很近的人!)在没有清洁光学器件的情况下设法拔出光缆并重新插入 - 真是个白痴。经过仔细清洁后,一切都很完美。我们生活和学习。

相关内容