Redis Sentinel 错误故障转移

2024-5-30 • tag-icon

我在使用 Redis 时遇到了问题。

我有 3 台服务器。每台服务器有 10 个 Redise 在不同的端口上运行。此外，每台服务器都有一个 Sentinel 实例。

另外，还有 5 个应用服务器。每个服务器都有 HaProxy，它从池中检查哪个 Redis 服务器是主服务器，并将流量重定向到该服务器。

因此，无论哪个服务器是 Redis 主服务器，应用程序始终使用 127.0.0.1:port。

我遇到了误报问题。这是我的日志：

[37338] 29 Apr 07:51:36.813 # Connection with slave ip1:6388 lost.
[37338] 29 Apr 07:51:37.399 # Connection with slave ip3:6388 lost.
[7244] 29 Apr 07:51:38.809 * DB saved on disk
[7244] 29 Apr 07:51:38.814 * RDB: 35 MB of memory used by copy-on-write
[37338] 29 Apr 07:51:38.918 * Background saving terminated with success
[37338] 29 Apr 07:51:47.451 * SLAVE OF 192.168.234.ip1:6388 enabled (user request)
[37338] 29 Apr 07:51:47.457 # CONFIG REWRITE executed with success.
[37338] 29 Apr 07:51:47.541 * Connecting to MASTER ip1:6388
[37338] 29 Apr 07:51:47.541 * MASTER <-> SLAVE sync started
[37338] 29 Apr 07:51:47.541 * Non blocking connect for SYNC fired the event.
[37338] 29 Apr 07:51:47.541 * Master replied to PING, replication can continue...
[37338] 29 Apr 07:51:47.541 * Partial resynchronization not possible (no cached master)
[37338] 29 Apr 07:51:47.542 * Full resync from master: 0be90102031e58ef358f0ea48e58eeae869902d1:157705847
[37338] 29 Apr 07:51:51.730 * MASTER <-> SLAVE sync: receiving 85082188 bytes from master
[37338] 29 Apr 07:51:52.588 * MASTER <-> SLAVE sync: Flushing old data
[37338] 29 Apr 07:51:53.011 * MASTER <-> SLAVE sync: Loading DB in memory
[37338] 29 Apr 07:51:54.401 * MASTER <-> SLAVE sync: Finished with success
[37338] 29 Apr 07:52:39.072 * 10000 changes in 60 seconds. Saving...
[37338] 29 Apr 07:52:39.083 * Background saving started by pid 27656

但没有发生任何可能导致这种情况的事情。

当 Sentinel 超时时间太短（100 毫秒）时，我已经遇到问题，所以我将其更改为 5 秒。直到今天都没有超时。

此外，5 台服务器是冗余的。当我关闭其中一半时，应用程序开始变慢。我没有看到 CPU、内存、磁盘问题。

昨天，我遇到了一个问题，当一半应用服务器关闭时，流量会下降。服务器会开始闲置几秒钟，然后再次正常工作。Nginx（前端服务器）使用 proxypass 来平衡到应用服务器的流量。我怀疑端口范围（netstat 计数超过 90k 个连接）。此外，在 timewait 中有超过 30k 个从 127.0.0.1:someport 到 127.0.0.1:redisport 的连接。应用程序 -> Haproxy Redis。

net.ipv4.ip_local_port_range = 10000    65535

另外，我已经开启：

net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_tw_reuse = 1

此后不再出现流量丢失，但应用程序仍然太慢，而且没有明显的原因（cpu、mem、hdd……一切正常）。

我还可以检查什么？

相关内容