为什么我使用 iperf3 时会看到网络重传？

2024-6-19 • tag-icon

linux networking docker bridge

为什么我使用 iperf3 时会看到网络重传？

我在设置的 kubernetes 集群中看到两个 pod 之间的重传。我正在使用 kube-routerhttps://github.com/cloudnativelabs/kube-router用于主机之间的网络连接。设置如下：

host-a 和 host-b 连接到相同的交换机。它们位于相同的 L2 网络上。两者都通过 LACP 802.3ad 绑定连接到上述交换机，并且这些绑定已启动并正常运行。

pod-a 和 pod-b 分别位于 host-a 和 host-b 上。我在 pod 之间运行 iperf3，并看到重传。

root@pod-b:~# iperf3 -c 10.1.1.4
Connecting to host 10.1.1.4, port 5201
[  4] local 10.1.2.5 port 55482 connected to 10.1.1.4 port 5201
[ ID] Interval           Transfer     Bandwidth       Retr  Cwnd
[  4]   0.00-1.00   sec  1.15 GBytes  9.86 Gbits/sec  977   3.03 MBytes
[  4]   1.00-2.00   sec  1.15 GBytes  9.89 Gbits/sec  189   3.03 MBytes
[  4]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec   37   3.03 MBytes
[  4]   3.00-4.00   sec  1.15 GBytes  9.89 Gbits/sec  181   3.03 MBytes
[  4]   4.00-5.00   sec  1.15 GBytes  9.90 Gbits/sec    0   3.03 MBytes
[  4]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0   3.03 MBytes
[  4]   6.00-7.00   sec  1.15 GBytes  9.88 Gbits/sec  305   3.03 MBytes
[  4]   7.00-8.00   sec  1.15 GBytes  9.90 Gbits/sec   15   3.03 MBytes
[  4]   8.00-9.00   sec  1.15 GBytes  9.89 Gbits/sec  126   3.03 MBytes
[  4]   9.00-10.00  sec  1.15 GBytes  9.86 Gbits/sec  518   2.88 MBytes
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bandwidth       Retr
[  4]   0.00-10.00  sec  11.5 GBytes  9.89 Gbits/sec  2348             sender
[  4]   0.00-10.00  sec  11.5 GBytes  9.88 Gbits/sec                  receiver

iperf Done.

我尝试调试的问题是，当我直接在 host-a 和 host-b 上运行相同的 iperf3 时（而不是通过 kube-router 创建的桥接接口），我没有看到重传。因此，网络路径看起来像这样：

pod-a -> kube-bridge -> host-a -> L2 switch -> host-b -> kube-bridge -> pod-b

从等式中移除 kube-bridge 会导致零重传。我已测试 host-a 到 pod-b，并看到相同的重传。

我一直在运行 dropwatch 并在接收主机（默认为 iperf3 服务器）：

% dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
2 drops at ip_rcv_finish+1f3 (0xffffffff87522253)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
1 drops at __brk_limit+35f81ba4 (0xffffffffc0761ba4)
16991 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at tcp_v4_do_rcv+87 (0xffffffff87547ef7)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
2 drops at ip_rcv_finish+1f3 (0xffffffff87522253)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
3 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
16091 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at __brk_limit+35f81ba4 (0xffffffffc0761ba4)
1 drops at tcp_v4_do_rcv+87 (0xffffffff87547ef7)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
8463 drops at skb_release_data+9e (0xffffffff874c6a4e)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
2 drops at tcp_v4_do_rcv+87 (0xffffffff87547ef7)
2 drops at ip_rcv_finish+1f3 (0xffffffff87522253)
2 drops at skb_release_data+9e (0xffffffff874c6a4e)
15857 drops at skb_release_data+9e (0xffffffff874c6a4e)
1 drops at sk_stream_kill_queues+48 (0xffffffff874ccb98)
1 drops at __brk_limit+35f81ba4 (0xffffffffc0761ba4)
7111 drops at skb_release_data+9e (0xffffffff874c6a4e)
9037 drops at skb_release_data+9e (0xffffffff874c6a4e)

发送方看到了丢包，但是没有达到我们在这里看到的数量（每行输出最多 1-2 个；我希望这是正常的）。

另外，我使用 9000 MTU（在交换机的 bond0 接口上和网桥上）。

我正在运行 CoreOS Container Linux Stable 1632.3.0。Linux 主机名 4.14.19-coreos #1 SMP 2018 年 2 月 14 日星期三 03:18:05 UTC x86_64 GNU/Linux

如能得到任何帮助或指点我将非常感激。

更新：尝试了 1500 MTU，结果相同。大量重传。

更新2：似乎iperf3 -b 10G ...在 Pod 之间和主机上（LACP Bond 中的 2x 10Gbit NIC）没有问题。问题出现iperf3 -b 11G在 Pod 之间使用时，但不是主机之间。iperf3 是否能够智能地确定 NIC 大小，但不能在本地桥接 veth 上做到这一点？

答案1

kube-router 的作者在这里。Kube-router 依靠 Bridge CNI 插件来创建 kube-bridge。它的标准 Linux 网络没有专门针对 pod 网络进行调整。kube-bridge 设置为默认值 1500。我们有一个未解决的错误，需要将其设置为某个合理的值。

https://github.com/cloudnativelabs/kube-router/issues/165

您是否认为看到的错误是由于 MTU 较少造成的？

答案2

看起来是某些东西（NIC 还是内核？）在将流量输出到接口时减慢了速度bond0。在 Linux 桥接器（pod）的情况下，“NIC”只是一个 veth，当我测试它时，它的峰值达到 47Gbps 左右。因此，当要求 iperf3 将数据包从接口发送出去时bond0，它会超出接口并最终导致数据包丢失（不清楚为什么我们在接收主机上看到丢包）。

我确认，如果我应用tcqdisc 类将 pod 接口减慢到 10gbps，则只需将 iperf3 运行到另一个 pod 时就不会造成任何损失。

tc qdisc add dev eth0 root handle 1:0 htb default 10
tc class add dev eth0 parent 1:0 classid 1:10 htb rate 10Gbit

这足以确保没有带宽设置的 iperf3 运行不会因 NIC 超载而导致重新传输。我将寻找一种方法来减慢使用 NIC 的流量tc。

更新：以下是减慢除本地桥接子网之外的所有流量的方法。

tc qdisc add dev eth0 root handle 1:0 htb default 10
tc class add dev eth0 classid 1:5 htb rate 80Gbit
tc class add dev eth0 classid 1:10 htb rate 10Gbit
tc filter add dev eth0 parent 1:0 protocol ip u32 match ip dst 10.81.18.4/24 classid 1:5

答案3

你不能拥有一TCP 连接超过 10Gb，且绑定接口为 20Gb。现在，如果您执行了，iperf3 -P 2则可能总共可以使用 20Gb，具体取决于/sys/class/net/bond0/bonding/xmit_hash_policy两台主机上的设置——默认值为layer2+3，但如果您将其设置为layer3+4（源/目标 ip/端口上的哈希值），则应该将负载分散到两台 NIC 之间，直至绑定的最大速度。

我偶然发现了这个帖子，其中有一个类似的问题，但是当我使用 2 个以上的并行流运行 iperfs 时，我遇到了掉线的情况较少的不同子网中 10Gb*2 绑定主机之间的总带宽超过 20Gb...Juniper 已复制了该问题，但还没有任何好的答案 :( 如果他们无法解决这个问题，也许下一步就是使用 Linux QoS。

相关内容