我目前正在使用 Nginx 作为负载均衡器,以平衡运行 NodeJS API 的 3 个节点之间的网络流量。
Nginx 实例在节点 1 上运行,每个请求都发送到节点 1。我发现 2 小时内的请求数约为 700k,nginx 配置为以循环方式在节点 1、节点 2 和节点 3 之间切换。以下是conf.d/deva.conf
:
upstream deva_api {
server 10.8.0.30:5555 fail_timeout=5s max_fails=3;
server 10.8.0.40:5555 fail_timeout=5s max_fails=3;
server localhost:5555;
keepalive 300;
}
server {
listen 8000;
location /log_pages {
proxy_redirect off;
proxy_set_header Host $host;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_http_version 1.1;
proxy_set_header Connection "";
add_header 'Access-Control-Allow-Origin' '*';
add_header 'Access-Control-Allow-Methods' 'GET, POST, PATCH, PUT, DELETE, OPTIONS';
add_header 'Access-Control-Allow-Headers' 'Authorization,Content-Type,Origin,X-Auth-Token';
add_header 'Access-Control-Allow-Credentials' 'true';
if ($request_method = OPTIONS ) {
return 200;
}
proxy_pass http://deva_api;
proxy_set_header Connection "Keep-Alive";
proxy_set_header Proxy-Connection "Keep-Alive";
auth_basic "Restricted"; #For Basic Auth
auth_basic_user_file /etc/nginx/.htpasswd; #For Basic Auth
}
}
下面是nginx.conf
配置:
user www-data;
worker_processes auto;
pid /run/nginx.pid;
include /etc/nginx/modules-enabled/*.conf;
worker_rlimit_nofile 65535;
events {
worker_connections 65535;
use epoll;
multi_accept on;
}
http {
##
# Basic Settings
##
sendfile on;
tcp_nopush on;
tcp_nodelay on;
keepalive_timeout 120;
send_timeout 120;
types_hash_max_size 2048;
server_tokens off;
client_max_body_size 100m;
client_body_buffer_size 5m;
client_header_buffer_size 5m;
large_client_header_buffers 4 1m;
open_file_cache max=200000 inactive=20s;
open_file_cache_valid 30s;
open_file_cache_min_uses 2;
open_file_cache_errors on;
reset_timedout_connection on;
include /etc/nginx/mime.types;
default_type application/octet-stream;
##
# SSL Settings
##
ssl_protocols TLSv1 TLSv1.1 TLSv1.2; # Dropping SSLv3, ref: POODLE
ssl_prefer_server_ciphers on;
##
# Logging Settings
##
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log;
##
# Gzip Settings
##
gzip on;
include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;
}
问题是,使用此配置,我在 error.log 中收到数百个错误,如下所示:
upstream prematurely closed connection while reading response header from upstream
但仅在 node2 和 node3 上。我已经尝试了以下测试:
- 增加每个节点的并发 API 数量(实际上我使用 PM2 作为节点内平衡器)
- 删除一个节点以简化 nginx 的工作
- 将权重应用到 nginx
没有什么能让结果变得更好。在这些测试中,我注意到只有 2 个远程节点(node2 和 node3)上存在错误,所以我尝试将它们从等式中删除。结果是我不再遇到类似的错误,但我开始遇到 2 个不同的错误:
recv() failed (104: Connection reset by peer) while reading response header from upstream
和
writev() failed (32: Broken pipe) while sending request to upstream
问题似乎是由于 node1 上缺少 API,API 可能无法在客户端超时之前响应所有入站流量(这是我的猜测)。话虽如此,我增加了 node1 上的并发 API 数量,结果比以前更好,但我仍然收到后 2 个错误,并且我无法再增加 node1 上的并发 API。
那么,问题是,为什么我不能将 nginx 用作所有节点的负载均衡器?我在 nginx 配置中犯了错误吗?还有其他我没有注意到的问题吗?
编辑: 我在 3 个节点之间运行了一些网络测试。节点通过 Openvpn 相互通信:
ping:
node1->node2
PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data.
64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.85 ms
64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=1.85 ms
64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=3.17 ms
64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=3.21 ms
64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.68 ms
node1->node2
PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data.
64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.16 ms
64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=3.08 ms
64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=10.9 ms
64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=3.11 ms
64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=3.25 ms
node2->node1
PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data.
64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=2.30 ms
64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=8.30 ms
64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.37 ms
64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.42 ms
64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.37 ms
node2->node3
PING 10.8.0.40 (10.8.0.40) 56(84) bytes of data.
64 bytes from 10.8.0.40: icmp_seq=1 ttl=64 time=2.86 ms
64 bytes from 10.8.0.40: icmp_seq=2 ttl=64 time=4.01 ms
64 bytes from 10.8.0.40: icmp_seq=3 ttl=64 time=5.37 ms
64 bytes from 10.8.0.40: icmp_seq=4 ttl=64 time=2.78 ms
64 bytes from 10.8.0.40: icmp_seq=5 ttl=64 time=2.87 ms
node3->node1
PING 10.8.0.12 (10.8.0.12) 56(84) bytes of data.
64 bytes from 10.8.0.12: icmp_seq=1 ttl=64 time=8.24 ms
64 bytes from 10.8.0.12: icmp_seq=2 ttl=64 time=2.72 ms
64 bytes from 10.8.0.12: icmp_seq=3 ttl=64 time=2.63 ms
64 bytes from 10.8.0.12: icmp_seq=4 ttl=64 time=2.91 ms
64 bytes from 10.8.0.12: icmp_seq=5 ttl=64 time=3.14 ms
node3->node2
PING 10.8.0.30 (10.8.0.30) 56(84) bytes of data.
64 bytes from 10.8.0.30: icmp_seq=1 ttl=64 time=2.73 ms
64 bytes from 10.8.0.30: icmp_seq=2 ttl=64 time=2.38 ms
64 bytes from 10.8.0.30: icmp_seq=3 ttl=64 time=3.22 ms
64 bytes from 10.8.0.30: icmp_seq=4 ttl=64 time=2.76 ms
64 bytes from 10.8.0.30: icmp_seq=5 ttl=64 time=2.97 ms
通过 IPerf 检查带宽:
node1 -> node2
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 229 MBytes 192 Mbits/sec
node2->node1
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 182 MBytes 152 Mbits/sec
node3->node1
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 160 MBytes 134 Mbits/sec
node3->node2
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 260 MBytes 218 Mbits/sec
node2->node3
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 241 MBytes 202 Mbits/sec
node1->node3
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 187 MBytes 156 Mbits/sec
看来 OpenVPN 隧道存在瓶颈,因为同样的测试通过eth
大约 1Gbits。话虽如此,我已经按照这个指南社区.openvpn.net但我得到的带宽只是之前测量的两倍。
我想保持 OpenVPN 处于开启状态,那么是否可以进行其他调整以增加网络带宽或对 nginx 配置进行其他调整以使其正常工作?
答案1
问题是由 OpenVPN 网络速度慢引起的。通过在每台不同的服务器上添加身份验证后路由互联网上的请求,我们将错误次数减少到每天 1-2 次,现在可能是由其他问题引起的。