我们的应用服务器偶尔会收到大量 502 错误。我尝试扩展服务器(从 t3.medium 到 t3.large),但我担心这是一个比扩展更深层次的问题。
受影响的服务器运行 NGINX 作为反向代理和 .NET Core 服务。
我们确实通过在一个服务停止运行时运行 2 个服务来实现无停机时间部署。up-stores-a 和 up-stores-b,一旦部署完成,nginx 就会将请求路由到另一个服务。部署期间不会发生“坏网关”问题,这让人非常困惑,为什么上游名称会发挥作用,但我还是想提供背景信息。
以下是服务器设置(删除了一些信息):
upstream up-stores-a {
server 127.0.0.1:51285;
server 127.0.0.1:51284 backup;
keepalive 32;
}
upstream up-stores-b {
server 127.0.0.1:51284;
server 127.0.0.1:51285 backup;
keepalive 32;
}
server {
server_name stores.{{url}};
location / {
proxy_pass http://up-stores-a;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection keep-alive;
proxy_set_header Host $http_host;
proxy_cache_bypass $http_upgrade;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
access_log /var/log/nginx/access-stores.log upstreamlog;
}
}
这是一大块日志:
[14/Apr/2020:18:42:21 +0000] {{ip}} - - - {{url}} {{url}} to: 127.0.0.1:51285, up-stores-a: GET /cc-small/none HTTP/1.1 502 upstream_response_time 1.004, 0.000 msec 1586889741.801 request_time 1.006
[14/Apr/2020:18:42:21 +0000] {{ip}} - - - {{url}} {{url}} to: 127.0.0.1:51285, up-stores-a: GET /cc-small/none HTTP/1.1 502 upstream_response_time 0.556, 0.000 msec 1586889741.804 request_time 0.558
[14/Apr/2020:18:42:21 +0000] {{ip}}- - - {{url}} {{url}} to: 127.0.0.1:51285, up-stores-a: GET /cc-small/none HTTP/1.1 502 upstream_response_time 0.812, 0.000 msec 1586889741.804 request_time 0.817
[14/Apr/2020:18:42:21 +0000] {{ip}} - - - {{url}} {{url}} to: 127.0.0.1:51285, up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.700, 0.000 msec 1586889741.805 request_time 0.707
[14/Apr/2020:18:42:21 +0000] {{ip}} - - - {{url}} {{url}} to: 127.0.0.1:51285, up-stores-a: GET /cc-small/none HTTP/1.1 502 upstream_response_time 0.500, 0.000 msec 1586889741.805 request_time 0.503
[14/Apr/2020:18:42:21 +0000] {{ip}}- - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889741.861 request_time 0.000
[14/Apr/2020:18:42:21 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889741.885 request_time 0.000
[14/Apr/2020:18:42:21 +0000] {{ip}}- - - {{url}} {{url}} to: up-stores-a: GET /cc-small-many-sku/33249675870348,33249887387788,32995463921804 HTTP/1.1 502 upstream_response_time 0.000 msec 1586889741.889 request_time 0.000
[14/Apr/2020:18:42:21 +0000] {{ip}}- - - {{url}} {{url}} to: up-stores-a: GET /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889741.913 request_time 0.000
[14/Apr/2020:18:42:21 +0000] {{ip}}- - - {{url}} {{url}} to: up-stores-a: GET /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889741.979 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.079 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}}- - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.088 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small-many-sku/159958564873 HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.090 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}}- - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small-many-sku/15937965064282,15937965097050 HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.274 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small-many-sku/31700323500129 HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.288 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}}- - - {{url}} {{url}} to: up-stores-a: GET /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.373 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small-many-sku/18157812801 HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.437 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.461 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.488 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.517 request_time 0.000
[14/Apr/2020:18:42:22 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889742.603 request_time 0.000
[14/Apr/2020:18:42:23 +0000] {{ip}} - - - {{url}} {{url}} to: up-stores-a: OPTIONS /cc-small/none HTTP/1.1 502 upstream_response_time 0.000 msec 1586889743.132 request_time 0.000
我还发现grep " up-stores-a" access-stores.log
返回了所有 502,因此似乎当“to”部分包含 up-stores-a 时,它会失败。将进一步扩大规模并查看是否有帮助,但没有任何 CPU 峰值、内存峰值、网络峰值或任何其他指标可以指出 .net core 服务无法返回的根本原因。根据访问日志,发生错误的那一分钟的请求数并不比其他任何没有错误的时刻多。
另外,请求时间为 0.000 并且失败了,这很奇怪吗?就好像它根本就没有尝试过一样。
我可以收集其他数据来帮助我找到根本原因吗?
答案1
正如许多其他问题已得到解答一样,问题不是由 NGINX 引起的,而是服务引起的。
事实证明,升级到 dotnet core 3 并控制 redis 超时是这里的制胜法宝。我猜 dotnet core 3 有一些稳定性修复,可以防止出现 502 错误。