以太网接口停止响应约 30 秒，然后确认所有收到的数据包的原因是什么？

2024-6-1 • tag-icon

第一个问题！嗨！

在 Ubuntu 16.04 上运行。

硬件信息：lspci | awk '/[Nn]et/ {print $1}' | xargs -i% lspci -ks %

00:1f.6 Ethernet controller: Intel Corporation Ethernet Connection (2) I219-V
    Subsystem: ASUSTeK Computer Inc. Ethernet Connection (2) I219-V
    Kernel driver in use: e1000e
    Kernel modules: e1000e
02:00.0 Network controller: Intel Corporation Device 093c (rev 3a)
    Subsystem: Intel Corporation Device 7001

运行 P2P 应用程序时，我遇到了一些奇怪的以太网停顿 -> 更准确地说：https://github.com/prysmaticlabs/prysm。根据相同的应用程序日志，大约有 30 个对等点连接到我的机器。带宽利用率很低（峰值为 6 Mbps），我使用 Cat6 电缆运行，光纤上行链路速度约为 120 Mbps，端口正确转发，如报告所示你可以看到我吗org。其他 P2P 应用（例如 torrents）没有显示任何冲突行为。

如上所述，症状很奇怪。当我运行该应用程序时，它似乎没有失去连接。但当另一个需要在网络上运行的应用程序（例如，网页浏览、聊天、文件传输）时，界面会停滞几秒钟甚至几分钟。我注意到这一点是因为浏览经常会超时。

当发生卡顿时，应用程序会继续正常运行，但所有其他应用程序都会失去互联网连接。我监控 ICMP（ping）流量：

从主机到路由器
从另一台本地主机到停滞主机

在这两种设备中，它都停止返回任何类型的响应（终端停止输出，没有反馈，也没有超时）。长时间停顿后，突然间，所有包都得到了确认。请参见此示例：

64 bytes from 192.168.1.1: icmp_seq=1122 ttl=64 time=0.304 ms
64 bytes from 192.168.1.1: icmp_seq=1123 ttl=64 time=0.303 ms
64 bytes from 192.168.1.1: icmp_seq=1124 ttl=64 time=0.313 ms
64 bytes from 192.168.1.1: icmp_seq=1125 ttl=64 time=0.263 ms
64 bytes from 192.168.1.1: icmp_seq=1126 ttl=64 time=0.266 ms
64 bytes from 192.168.1.1: icmp_seq=1127 ttl=64 time=0.273 ms
64 bytes from 192.168.1.1: icmp_seq=1128 ttl=64 time=0.289 ms
64 bytes from 192.168.1.1: icmp_seq=1129 ttl=64 time=0.276 ms
64 bytes from 192.168.1.1: icmp_seq=1130 ttl=64 time=0.280 ms
64 bytes from 192.168.1.1: icmp_seq=1131 ttl=64 time=0.635 ms
64 bytes from 192.168.1.1: icmp_seq=1132 ttl=64 time=0.292 ms
64 bytes from 192.168.1.1: icmp_seq=1133 ttl=64 time=0.537 ms
64 bytes from 192.168.1.1: icmp_seq=1134 ttl=64 time=0.299 ms
64 bytes from 192.168.1.1: icmp_seq=1135 ttl=64 time=0.272 ms
64 bytes from 192.168.1.1: icmp_seq=1136 ttl=64 time=27625 ms
64 bytes from 192.168.1.1: icmp_seq=1137 ttl=64 time=26635 ms
64 bytes from 192.168.1.1: icmp_seq=1138 ttl=64 time=25631 ms
64 bytes from 192.168.1.1: icmp_seq=1139 ttl=64 time=24640 ms
64 bytes from 192.168.1.1: icmp_seq=1140 ttl=64 time=23641 ms
64 bytes from 192.168.1.1: icmp_seq=1141 ttl=64 time=22671 ms
64 bytes from 192.168.1.1: icmp_seq=1142 ttl=64 time=21648 ms
64 bytes from 192.168.1.1: icmp_seq=1143 ttl=64 time=20652 ms
64 bytes from 192.168.1.1: icmp_seq=1144 ttl=64 time=19658 ms
64 bytes from 192.168.1.1: icmp_seq=1145 ttl=64 time=18655 ms
64 bytes from 192.168.1.1: icmp_seq=1146 ttl=64 time=17658 ms
64 bytes from 192.168.1.1: icmp_seq=1147 ttl=64 time=16659 ms
64 bytes from 192.168.1.1: icmp_seq=1148 ttl=64 time=15655 ms
64 bytes from 192.168.1.1: icmp_seq=1149 ttl=64 time=14632 ms
64 bytes from 192.168.1.1: icmp_seq=1150 ttl=64 time=13611 ms
64 bytes from 192.168.1.1: icmp_seq=1151 ttl=64 time=12588 ms
64 bytes from 192.168.1.1: icmp_seq=1152 ttl=64 time=11565 ms
64 bytes from 192.168.1.1: icmp_seq=1153 ttl=64 time=10542 ms
64 bytes from 192.168.1.1: icmp_seq=1154 ttl=64 time=9522 ms
64 bytes from 192.168.1.1: icmp_seq=1155 ttl=64 time=8501 ms
64 bytes from 192.168.1.1: icmp_seq=1156 ttl=64 time=7478 ms
64 bytes from 192.168.1.1: icmp_seq=1157 ttl=64 time=6459 ms
64 bytes from 192.168.1.1: icmp_seq=1158 ttl=64 time=5436 ms
64 bytes from 192.168.1.1: icmp_seq=1159 ttl=64 time=4415 ms
64 bytes from 192.168.1.1: icmp_seq=1160 ttl=64 time=3391 ms
64 bytes from 192.168.1.1: icmp_seq=1161 ttl=64 time=2370 ms
64 bytes from 192.168.1.1: icmp_seq=1162 ttl=64 time=1350 ms
64 bytes from 192.168.1.1: icmp_seq=1163 ttl=64 time=320 ms
64 bytes from 192.168.1.1: icmp_seq=1164 ttl=64 time=2.73 ms
64 bytes from 192.168.1.1: icmp_seq=1165 ttl=64 time=0.258 ms
64 bytes from 192.168.1.1: icmp_seq=1166 ttl=64 time=0.303 ms

然后网络暂时恢复正常。

我尝试过的事情：

将 MTU 从 1500 增加到 9000（无效果）
将 txqueuelen 从 1000 增加到 11000（无效果）
限制可连接的对等点数量（无效）
虚拟化（无效果）
删除端口转发。这似乎有效，尽管它违背了应用程序的初衷并使其运行速度明显变慢。

目前我有两个理论：

1) 要么是网关行为异常（无法检查）。我放弃这个，因为网络中的其他设备运行正常，无论是在本地连接还是外部连接中 2) 或者是某种内存缓冲区堵塞，但不知道是哪种。

我将非常感激您的启发！

答案1

对于该卡，您可以尝试使用此内核参数进行启动。这解释了如何做到这一点：

pcie_aspm=off

另一种方法是使用ethtool。例如：

sudo ethtool -G eth0 rx 256 tx 256

那来自这里。

答案2

在对网络中的所有元素进行更多调试后，我发现尽管其他设备的影响不那么明显，但它们确实受到了流量堵塞的影响，所以这让我认为问题出在路由器/交换机上，它们可能因为 NAT 转换而无法满足 P2P 应用程序的需求。我会尝试获得更先进的硬件来解决这个问题。

答案1

答案2

相关内容