Redis 集群故障转移过于频繁

2024-6-19 • tag-icon

在 12 个主服务器中，4 个主服务器同时发生故障，原因不明。从 slowlog 中找不到太多信息。我们只配置了 1 个从服务器，每当发生故障时，主服务器链接状态将关闭超过 1 分钟。我怀疑应用程序执行了某些阻止命令，这导致了这种情况。需要帮助来弄清楚发生了什么。

从属日志：

484:C 11 Apr 04:53:03.068 * DB saved on disk
484:C 11 Apr 04:53:03.188 * RDB: 792 MB of memory used by copy-on-write
1234:S 11 Apr 04:53:03.927 * Background saving terminated with success
1234:S 11 Apr 04:53:44.041 * FAIL message received from 5f069f8a114b8443dfe58ab6c09088d1fad27862 about ae636a7d3eb31cde02ed27a0c29d2c06c8e7f1e6
1234:S 11 Apr 04:53:44.042 # Cluster state changed: fail
1234:S 11 Apr 04:53:44.099 # Start of election delayed for 918 milliseconds (rank #0, offset 7442811704468).
1234:S 11 Apr 04:53:45.117 # Starting a failover election for epoch 786.
1234:S 11 Apr 04:53:45.125 # Failover election won: I'm the new master.
1234:S 11 Apr 04:53:45.125 # configEpoch set to 786 after successful failover
1234:M 11 Apr 04:53:45.125 # Setting secondary replication ID to fa3ca72957d6efb1791607546cebaeb715647af4, valid up to offset: 7442811704469. New replication ID is b303d25e8fcc1f69b46201e423decd2ff3b7e928
1234:M 11 Apr 04:53:45.126 # Connection with master lost.
1234:M 11 Apr 04:53:45.126 * Caching the disconnected master state.
1234:M 11 Apr 04:53:45.128 * Discarding previously cached master state.
1234:M 11 Apr 04:53:45.128 # Cluster state changed: ok

主日志：

21719:M 11 Apr 04:52:38.960 * Background saving terminated with success
21719:M 11 Apr 04:53:39.075 * 10000 changes in 60 seconds. Saving...
21719:M 11 Apr 04:53:39.171 * Background saving started by pid 5247
21719:M 11 Apr 04:53:44.042 * Marking node ae636a7d3eb31cde02ed27a0c29d2c06c8e7f1e6 as failing (quorum reached).
21719:M 11 Apr 04:53:44.042 # Cluster state changed: fail
21719:M 11 Apr 04:53:45.124 # Failover auth granted to 4780ee3be12c243751617b84308aa73270fda065 for epoch 786
21719:M 11 Apr 04:53:45.162 # Cluster state changed: ok

Redis 配置：

# Snapshotting
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb

# Replication
slave-serve-stale-data yes
slave-read-only yes
repl-disable-tcp-nodelay no
slave-priority 100
min-slaves-max-lag 10
# Security
# Limits
maxclients 60000
maxmemory 12602mb
maxmemory-policy allkeys-lru

# Append Only Mode
appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# Lua
lua-time-limit 5000

# Slow Log
slowlog-log-slower-than 10000
slowlog-max-len 128

# Event Notification
notify-keyspace-events ""

# Advanced
hash-max-ziplist-entries 512
hash-max-ziplist-value 64
list-max-ziplist-entries 512
list-max-ziplist-value 64
set-max-intset-entries 512
zset-max-ziplist-entries 128
zset-max-ziplist-value 64
activerehashing yes
client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 0 0 0
client-output-buffer-limit pubsub 32mb 8mb 60
hz 10
aof-rewrite-incremental-fsync yes

# Cluster
cluster-enabled yes
cluster-config-file nodes.conf
cluster-node-timeout 5000
cluster-slave-validity-factor 1

到目前为止还没有内存不足的问题。

相关内容