这是我之前问题的延续,网卡不稳定:如何排除故障?。网卡是:
# networkctl -a status
...
● 4: ens6f0
Link File: /usr/lib/systemd/network/99-default.link
Network File: n/a
Type: ether
State: n/a (unmanaged)
Alternative Names: enp24s0f0
Path: pci-0000:18:00.0
Driver: ice
Vendor: Intel Corporation
Model: Ethernet Controller E810-C for QSFP (Ethernet Network Adapter E810-C-Q2)
HW Address: 64:9d:99:ff:fe:c0 (FS COM INC)
MTU: 1500 (min: 68, max: 9702)
QDisc: mq
IPv6 Address Generation Mode: eui64
Queue Length (Tx/Rx): 320/320
Auto negotiation: no
Speed: 100Gbps
Duplex: full
Port: fibre
Address: 192.168.50.7
fe80::669d:99ff:feff:fec0
Gateway: 192.168.50.1 (TP-LINK TECHNOLOGIES CO.,LTD.)
Failed to query link DHCP leases: Unit dbus-org.freedesktop.network1.service not found.
操作系统是:
# cat /etc/*release*
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
主板是:
# dmidecode -t baseboard
# dmidecode 3.3
Getting SMBIOS data from sysfs.
SMBIOS 3.3.0 present.
Handle 0x0002, DMI type 2, 15 bytes
Base Board Information
Manufacturer: Supermicro
Product Name: X12SPL-F
Version: 2.00
Serial Number: ZM224S007191
Asset Tag: Base Board Asset Tag
Features:
Board is a hosting board
Board is replaceable
Location In Chassis: Part Component
Chassis Handle: 0x0003
Type: Motherboard
Contained Object Handles: 0
NIC 位于唯一的 16 通道插槽中:
Handle 0x000F, DMI type 9, 17 bytes
System Slot Information
Designation: CPU SLOT6 PCI-E 4.0 X16
Type: x16 PCI Express 4 x16
Current Usage: In Use
Length: Long
ID: 6
Characteristics:
3.3 V is provided
Opening is shared
PME signal is supported
Bus Address: 0000:18:00.0
我遇到的问题是网卡不断掉线,原因不明,我不知道如何深入研究它。到目前为止,我所拥有的只是以下内容dmesg
:
# dmesg | grep 0000:18:00.0
[ 0.754043] pci 0000:18:00.0: [8086:1592] type 00 class 0x020000
[ 0.754056] pci 0000:18:00.0: reg 0x10: [mem 0x201ffa000000-0x201ffbffffff 64bit pref]
[ 0.754070] pci 0000:18:00.0: reg 0x1c: [mem 0x201ffe010000-0x201ffe01ffff 64bit pref]
[ 0.754080] pci 0000:18:00.0: reg 0x30: [mem 0xbb600000-0xbb6fffff pref]
[ 0.754166] pci 0000:18:00.0: reg 0x184: [mem 0x201ffd000000-0x201ffd01ffff 64bit pref]
[ 0.754168] pci 0000:18:00.0: VF(n) BAR0 space: [mem 0x201ffd000000-0x201ffdffffff 64bit pref] (contains BAR0 for 128 VFs)
[ 0.754179] pci 0000:18:00.0: reg 0x190: [mem 0x201ffe220000-0x201ffe223fff 64bit pref]
[ 0.754180] pci 0000:18:00.0: VF(n) BAR3 space: [mem 0x201ffe220000-0x201ffe41ffff 64bit pref] (contains BAR3 for 128 VFs)
[ 0.754429] pci 0000:18:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:17:02.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 0.800984] pci 0000:18:00.0: CLS mismatch (64 != 32), using 64 bytes
[ 1.369098] pci 0000:18:00.0: Adding to iommu group 31
[ 1.819150] ice 0000:18:00.0: firmware: failed to load intel/ice/ddp/ice-e20070ffffd99fd0.pkg (-2)
[ 1.819589] ice 0000:18:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg
[ 2.140744] ice 0000:18:00.0: The DDP package was successfully loaded: ICE OS Default Package version 1.3.30.0
[ 2.211858] ice 0000:18:00.0: PTP init successful
[ 2.616387] ice 0000:18:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
[ 2.616387] ice 0000:18:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
[ 2.616492] ice 0000:18:00.0: Commit DCB Configuration to the hardware
[ 2.618380] ice 0000:18:00.0: 126.016 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x16 link at 0000:17:02.0 (capable of 252.048 Gb/s with 16.0 GT/s PCIe x16 link)
[ 2.621272] ice 0000:18:00.0 eth0: A parallel fault was detected.
[ 2.621365] ice 0000:18:00.0 eth0: Possible Solution: Check link partner connection and configuration.
[ 2.621513] ice 0000:18:00.0 eth0: Port Number: 1.
[ 3.331319] ice 0000:18:00.0 ens6f0: renamed from eth0
[ 1052.057728] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
[2304065.370537] ice 0000:18:00.0 ens6f0: NIC Link is Down
[2304065.470757] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
[6567288.755539] ice 0000:18:00.0 ens6f0: Changing Rx descriptor count from 2048 to 8160
[10043828.294404] ice 0000:18:00.0 ens6f0: NIC Link is Down
[10043828.394033] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
[10198013.280727] ice 0000:18:00.0 ens6f0: NIC Link is Down
[10198013.381243] ice 0000:18:00.0 ens6f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
但我不相信这就是真正的问题——它似乎发生的频率不足以解释我所看到的问题,而且它会在不到一秒的时间内再次出现。这些问题似乎确实与网络连接有关,例如:
root@pluto:/home/comind# ping knox
PING knox.comind.io (192.168.50.7) 56(84) bytes of data.
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=1 ttl=64 time=0.476 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=2 ttl=64 time=0.542 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=3 ttl=64 time=0.521 ms
...
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=26 ttl=64 time=0.544 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=27 ttl=64 time=0.554 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=34 ttl=64 time=0.539 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=35 ttl=64 time=0.402 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=36 ttl=64 time=0.539 ms
...
中断(每次的长度icmp_sec=27
似乎icmp_sec=34
在 7 秒左右,并且经常发生。我在终端会话中看到类似的情况 - 键盘输入似乎停止了几秒钟,然后显示在终端上;有时字符是最后,该服务器的 NFS 共享也会受到同样的延迟影响。
NFS服务由ganesha V3.4提供,日志包含多行,例如:
13/01/2023 01:09:46 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_946] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc37422f1b0 fd 10798 msg_iov 0x7fc2da2e0f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 06:26:54 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_887] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc2190609f0 fd 10386 msg_iov 0x7fc447406f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 08:06:33 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_967] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc1f42aec90 fd 10387 msg_iov 0x7fc2d8ac8f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 08:36:01 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_967] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc11c5ee4c0 fd 10388 msg_iov 0x7fc2d8ac8f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 08:38:04 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_1032] rpc :TIRPC :EVENT :svc_ioq_flushv: 0x7fc134b4f480 fd 10394 msg_iov 0x7fc38cde1f60 sendmsg remaining 112 result -1 error Broken pipe (32)
13/01/2023 10:55:53 : epoch 63a6c5e3 : knox : ganesha.nfsd-3365103[svc_1032] rpc :TIRPC :EVENT :svc_vc_wait: 0x7fc1e8074320 fd 10603 recv errno 104 (will set dead)
同样,日志中没有足够的错误来解释频繁的延迟。
对我来说,很明显这是网络问题 - 服务器从 FS: 连接到交换机N5860-48SC
,但不幸的是我对交换机上的故障排除了解不够。对于如何解决此问题的任何帮助、见解或建议,我将不胜感激。
答案1
当链路不稳定时,尤其是在光纤上,一个非常好的指标是您是否遇到本地故障或远程故障。
查看命令中的计数器:
ethtool -S ens6f0
并看到这样的东西:
$ ethtool -S ens259f0 |grep fault
mac_local_faults.nic: 0
mac_remote_faults.nic: 0
如果那里什么都没有,请抓取输出
ethtool -m ens6f0
ethtool -S ens6f0
ethtool -i ens6f0
devlink dev info
并仔细检查您正在运行可用的最新固件/NVM 映像。
故障排除时最后一个要查看的地方是交换机日志本身,以查看它是否指示您遇到的是本地(交换机端)还是远程(E810 端)故障。
如果 E810 显示本地故障,则故障排除应引导您联系支持人员并提供上面收集的一些信息。有很多可能是错误的,但遵循上面的一些基本步骤应该有助于隔离一些错误。
答案2
执行此操作时ethtool -m
,检查是否有任何警报以及 RX/TX 功率水平是否在范围内,对于范围,您可以查看警报阈值。
不同的 SFP 模块的阈值可能有所不同。
答案3
我终于找到了问题所在:光学器件中有灰尘。不知何故,有人(没有提到名字,但这是一个离我很近的人!)在没有清洁光学器件的情况下设法拔出光缆并重新插入 - 真是个白痴。经过仔细清洁后,一切都很完美。我们生活和学习。