网卡不稳定:如何排除故障?

网卡不稳定:如何排除故障?

我的网络连接不稳定 - 我认为它一定是从交换机到服务器的,但我不知道如何对其进行故障排除。这是设置:

该交换机是 N5860-48SC、48 端口以太网 L3 数据中心交换机、48 x 10Gb SFP+,带有 8 x 100Gb QSFP28 - 服务器通过光缆连接到其中一个 100G 端口。我认为它一定是从该端口向下到服务器的原因是,使用铜线连接到 10G 端口的任何其他系统都没有问题。

该服务器是基于 Intel 的服务器,带有 100G NIC - lspci 和 ethtool 显示:

# lspci
...
51:00.0 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
51:00.1 Ethernet controller: Intel Corporation Ethernet Controller E810-C for QSFP (rev 02)
...
# ethtool ens4f0
Settings for ens4f0:
        Supported ports: [ FIBRE ]
        Supported link modes:   25000baseCR/Full
                                25000baseKR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full
                                50000baseSR2/Full
                                100000baseSR2/Full
                                100000baseCR2/Full
        Supported pause frame use: Symmetric
        Supports auto-negotiation: No
        Supported FEC modes: None
        Advertised link modes:  25000baseSR/Full
                                50000baseCR2/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: None
        Speed: 100000Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: FIBRE
        PHYAD: 0
        Transceiver: internal
        Supports Wake-on: d
        Wake-on: d
        Current message level: 0x00000007 (7)
                               drv probe link
        Link detected: yes

从另一个系统(超过 1G NIC)执行 Ping 操作很奇怪 - 有时无法访问,有时时间非常长:

...
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=560 ttl=64 time=0.297 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=561 ttl=64 time=0.284 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=562 ttl=64 time=0.231 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=563 ttl=64 time=0.280 ms
From 192.168.50.29 (192.168.50.29) icmp_seq=567 Destination Host Unreachable
From 192.168.50.29 (192.168.50.29) icmp_seq=568 Destination Host Unreachable
From 192.168.50.29 (192.168.50.29) icmp_seq=569 Destination Host Unreachable
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=570 ttl=64 time=0.423 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=571 ttl=64 time=0.275 ms
...
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=591 ttl=64 time=0.298 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=592 ttl=64 time=0.267 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=598 ttl=64 time=1020 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=599 ttl=64 time=0.337 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=600 ttl=64 time=0.260 ms
...

最后,/var/log/messages:

# cat messages
Aug 28 00:10:21 knox rsyslogd: [origin software="rsyslogd" swVersion="8.2102.0" x-pid="785" x-info="https://www.rsyslog.com"] rsyslogd was HUPed
Aug 29 11:25:09 knox kernel: [2160313.695273] ice 0000:51:00.0 ens4f0: NIC Link is Down
Aug 29 11:25:09 knox kernel: [2160313.791123] ice 0000:51:00.0 ens4f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
Aug 29 14:28:17 knox kernel: [2171301.703959] ice 0000:51:00.0 ens4f0: NIC Link is Down
Aug 29 14:28:17 knox kernel: [2171301.808407] ice 0000:51:00.0 ens4f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None
Aug 31 21:51:47 knox kernel: [2370711.058542] ice 0000:51:00.0 ens4f0: NIC Link is Down
Aug 31 21:51:47 knox kernel: [2370711.155567] ice 0000:51:00.0 ens4f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: Off, Autoneg Negotiated: False, Flow Control: None

这些可能与问题无关,但这肯定不是我手动完成的事情。

---编辑---

我刚刚注意到的另一件事 - 数据包似乎悄然丢失,请注意序列号从 7 到 16 以及 24 到 30 的跳跃:

# ping knox
PING knox.comind.io (192.168.50.7) 56(84) bytes of data.
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=1 ttl=64 time=0.374 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=2 ttl=64 time=0.233 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=3 ttl=64 time=0.267 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=4 ttl=64 time=0.234 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=5 ttl=64 time=0.277 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=6 ttl=64 time=0.301 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=7 ttl=64 time=0.234 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=16 ttl=64 time=0.273 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=17 ttl=64 time=0.224 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=18 ttl=64 time=0.224 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=19 ttl=64 time=0.312 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=20 ttl=64 time=0.291 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=21 ttl=64 time=0.275 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=22 ttl=64 time=0.282 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=23 ttl=64 time=0.243 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=24 ttl=64 time=0.274 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=30 ttl=64 time=0.260 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=31 ttl=64 time=0.497 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=32 ttl=64 time=0.280 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=33 ttl=64 time=0.273 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=34 ttl=64 time=0.283 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=35 ttl=64 time=0.273 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=36 ttl=64 time=0.297 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=37 ttl=64 time=0.244 ms
64 bytes from knox.comind.io (192.168.50.7): icmp_seq=38 ttl=64 time=0.273 ms

答案1

建议:

  1. 使用 systemd-networkd (翻译 ifupdown 的系统)检查您的配置。

    networkctl -a 状态

分析它是否是您想要的(该命令是 systemd-networkd 的一部分)。如果这不完全是您想要编写 /etc/network/interfaces 脚本的内容,请考虑直接使用 systemd 而不是 ifupdown。

  1. 检查失败的服务:

    systemctl --type=service --state=失败

  2. 如果您使用 /etc/network/interfaces 并且拥有 debian 11,那么您可能会使用较新的 systemd 来对抗(翻译)您的旧式配置。在这种情况下,将 /etc/network/interfaces 移动到 /etc/network/interfaces.save 并使用 systemd 配置网络。创造:

    /etc/systemd/network/10-mynet1.network

    /etc/systemd/network/20-mynet2.network

    ……

systemd-networkd 的语法与接口中的语法不同。

debian 11.x 中有新旧网络系统。该系统仅在我的 debian 11.5 最小服务器安装上的最简单情况下工作。我花了 5 天的时间来适应 ifupdown 和无法解释的行为。 1 小时后,我转移到了 systemd,一切都按预期进行。

问候, 博格丹

答案2

首先,感谢那些好心提供帮助的人 - 它激励我进行一些更深入的调试。我跑了dmidecode好几次,终于意识到了问题所在:卡被插入到 PCIe 3.0 插槽中,它只能提供 63 Gbps 左右的速度,所以运行在 100 Gbps 的网卡当然会遇到麻烦。当我将其移至唯一的 PCIe 4.0 插槽时,起初我无法检测到任何收发器;通过升级 BIOS 和 BMC 固件已修复此问题。

最后一个问题是只有一个模块处于活动状态;要激活它,我必须在 BIOS 中为该插槽设置 PCIe 分叉 - 然而,由于 100 Gbps 远远超过 RAID 卡支持的速度 (12 Gbps),我已经可以通过网络使磁盘阵列饱和,这样就可以等待了。

相关内容