EC2 VPC 间歇性出站连接超时

2024-6-1 • tag-icon

我的生产网络服务包括：

自动伸缩组
网络负载均衡器 (ELB)
2 个 EC2 实例作为 Web 服务器

此配置运行良好，直到昨天，其中一个 EC2 实例开始遇到 RDS 和 ElastiCache 超时。另一个实例继续运行，没有问题。

在调查过程中，我注意到传出连接有时会出现很大的延迟：

[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m7.147s -- 7 seconds
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m3.114s
user    0m0.007s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    0m0.051s
user    0m0.006s
sys     0m0.000s
[ec2-user@ip-10-0-5-9 logs]$ time curl -s www.google.com > /dev/null

real    1m6.309s -- over a minute!
user    0m0.009s
sys     0m0.000s

[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  * * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.196), 1 hops max, 60 byte packets
 1  216.182.226.174  17.706 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.8.4), 1 hops max, 60 byte packets
 1  216.182.226.174  20.364 ms * *
[ec2-user@ip-10-0-5-9 logs]$ traceroute -n -m 1 www.google.com
traceroute to www.google.com (172.217.7.132), 1 hops max, 60 byte packets
 1  216.182.226.170  12.680 ms  12.671 ms *

进一步分析表明，如果我手动将“坏”实例从自动扩展组中分离，将其从负载均衡器目标中移除，问题就会立即消失。一旦我将其添加回来，问题又会出现。

这些节点是 m5.xlarge，似乎有多余的容量，所以我不认为这是一个资源问题。

更新：这似乎与节点上的负载有关。昨晚我重新加载了负载，它似乎很稳定，但今天早上随着负载的增加，出站流量（DB 等）开始失败。我真的不明白这种出站流量是如何受到影响的。另一个相同的节点没有问题，即使流量为 100% 而不是 50%。

traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  216.182.226.174  18.691 ms 216.182.226.166  18.341 ms 216.182.226.174  18.660 ms
traceroute to 54.14.xx.xx (54.14.xx.xx), 1 hops max, 60 byte packets
 1  * * *

216.182.226.166 这个 IP 是什么？和 VPC IGW 有关吗？

节点统计：

m5.xlarge
CPU〜7.5％
平均负载：0.18、0.29、0.29
网络输入：~8M 字节/分钟

更新：将 2 个节点中的 1 个连接到负载均衡器后，一切似乎运行稳定 - 所有流量都在一个节点上。在我将第二个节点添加到负载均衡器后，经过一段时间（几小时 - 几天），其中一个节点开始出现上述出站连接问题（与数据库、Google 等的连接超时）。在此状态下，另一个节点运行正常。更换“坏节点”或在负载均衡器中恢复它可以让一切运行一段时间。这些图像使用 Amazon Linux 2（4.14.114-103.97.amzn2.x86_64）。

答案1

您可能正在使用 NAT 网关/实例来访问互联网。如果不是，您可能需要提供更多有关架构的信息。您可能正在使用直接连接，并可能通过本地网络路由互联网。

请阅读这些与系统限制、临时端口的入站连接相关的内容。

https://docs.aws.amazon.com/vpc/latest/userguide/vpc-recommended-nacl-rules.html https://aws.amazon.com/premiumsupport/knowledge-center/resolve-connection-nat-instance/

答案1

相关内容