Proxmox：LACP 模式下绑定网卡的数据包丢失

2024-6-19 • tag-icon

遇到了一个奇怪的问题。我有一台运行 Proxmox 5.3 的机器，它的硬件中有一个 4 端口英特尔 NIC 卡（千兆，PCI-e），主板上还有一个第五个千兆以太网。

我已将机器配置为板载 NIC 作为机器的管理接口，并且 4 个千兆位 NIC 通过 LACP 绑定在一起（并连接到 HP ProCurve 1810G 托管交换机）- 机箱上的所有 VM 和容器都通过绑定的 NIC 获得网络连接。显然，交换机是托管的并支持 LACP，并且我在交换机上为 4 个端口配置了中继设置。

一切似乎都运行良好，至少我是这么认为的。

周末我在 Proxmox 主机上安装了 netdata，现在我不断收到关于 bond0（4 个绑定网卡）数据包丢失的警报。我有点困惑为什么会这样。

查看 bond0 的统计数据，似乎 RX 数据包以合理的频率被丢弃（当前显示在过去 10 分钟内丢弃了 ~160 个 RX 数据包 - 似乎没有 TX 数据包被丢弃）。

下面的接口输出，您会注意到虚拟机的桥接接口没有丢包，这种情况只发生在 bond0 及其从属设备上。MTU 设置为 9000（交换机上启用了巨型帧）——当 MTU 为 1500 时，我仍然看到这个问题。enp12s0 是管理 NIC，其他 4 个 NIC 是绑定从属设备。

bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 9000
    ether 00:1b:21:c7:40:d8  txqueuelen 1000  (Ethernet)
    RX packets 347300  bytes 146689725 (139.8 MiB)
    RX errors 0  dropped 11218  overruns 0  frame 0
    TX packets 338459  bytes 132985798 (126.8 MiB)
    TX errors 0  dropped 2 overruns 0  carrier 0  collisions 0

enp12s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
    inet 192.168.1.3  netmask 255.255.255.0  broadcast 192.168.1.255
    inet6 fe80::7285:c2ff:fe67:19b9  prefixlen 64  scopeid 0x20<link>
    ether 70:85:c2:67:19:b9  txqueuelen 1000  (Ethernet)
    RX packets 25416597  bytes 36117733348 (33.6 GiB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 16850795  bytes 21472508786 (19.9 GiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp3s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
    ether 00:1b:21:c7:40:d8  txqueuelen 1000  (Ethernet)
    RX packets 225363  bytes 113059352 (107.8 MiB)
    RX errors 0  dropped 2805  overruns 0  frame 0
    TX packets 15162  bytes 2367657 (2.2 MiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp3s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
    ether 00:1b:21:c7:40:d8  txqueuelen 1000  (Ethernet)
    RX packets 25499  bytes 6988254 (6.6 MiB)
    RX errors 0  dropped 2805  overruns 0  frame 0
    TX packets 263442  bytes 123302293 (117.5 MiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp4s0f0: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
    ether 00:1b:21:c7:40:d8  txqueuelen 1000  (Ethernet)
    RX packets 33208  bytes 11681537 (11.1 MiB)
    RX errors 0  dropped 2804  overruns 0  frame 0
    TX packets 42729  bytes 2258949 (2.1 MiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

enp4s0f1: flags=6211<UP,BROADCAST,RUNNING,SLAVE,MULTICAST>  mtu 9000
    ether 00:1b:21:c7:40:d8  txqueuelen 1000  (Ethernet)
    RX packets 63230  bytes 14960582 (14.2 MiB)
    RX errors 0  dropped 2804  overruns 0  frame 0
    TX packets 17126  bytes 5056899 (4.8 MiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vmbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 9000
    inet 192.168.1.4  netmask 255.255.255.0  broadcast 192.168.1.255
    inet6 fe80::21b:21ff:fec7:40d8  prefixlen 64  scopeid 0x20<link>
    ether 00:1b:21:c7:40:d8  txqueuelen 1000  (Ethernet)
    RX packets 54616  bytes 5852177 (5.5 MiB)
    RX errors 0  dropped 0  overruns 0  frame 0
    TX packets 757  bytes 61270 (59.8 KiB)
    TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

最初怀疑是某种缓冲区问题，我在 sysctl 中做了一些调整以确保缓冲区大小足够。 sysctl 调整可以在这里找到（这些似乎没有任何区别）：

https://paste.linux.community/view/3b5f2b63

网络配置为：

auto lo
iface lo inet loopback

auto enp12s0
iface enp12s0 inet static
    address  192.168.1.3
    netmask  255.255.255.0

iface enp3s0f0 inet manual

iface enp3s0f1 inet manual

iface enp4s0f0 inet manual

iface enp4s0f1 inet manual

auto bond0
iface bond0 inet manual
    bond-slaves enp3s0f0 enp3s0f1 enp4s0f0 enp4s0f1
    bond-miimon 100
    bond-mode 802.3ad
    mtu 9000

auto vmbr0
iface vmbr0 inet static
    address  192.168.1.4
    netmask  255.255.255.0
    gateway  192.168.1.1
    bridge-ports bond0
    bridge-stp off
    bridge-fd 0

我采取的故障排除步骤：

a) sysctl 调整（如附件所示）b) 增加 MTU 并在交换机上启用巨型帧（无变化）c) 重置交换机并重新创建 LACP 中继（无变化）

有什么想法我下一步该尝试什么吗？我开始认为我对 NIC 组合有些不了解。正如我所说，一切似乎都运行良好，但我对高数据包丢失有点担心。

网络上连接到交换机的其他机器没有这个问题（机器上的第 5 个 NIC 也没有问题）。

答案1

我以前见过这种情况：HP 交换机似乎有时会向 LACP 中继的所有成员发送广播数据包。然后内核将这些数据包视为重复数据包并丢弃它们（当然，除了第一个到达的数据包）。

虽然这当然不够优雅，但在现实生活中似乎不会造成问题。您可以通过故意发送许多广播数据包并检查这是否与丢弃统计数据一致来验证是否是这种影响。

答案1

相关内容