MariaDB 集群在节点之间出现超时问题

2024-5-31 • tag-icon

我在诊断 MariaDB 集群问题时遇到了一些困难，我希望得到一些建议。

我们正在运行一个 3 节点 MariaDB 集群，每个节点都在一个专用的 ESXi 服务器上，并通过同一数据中心的本地网络连接。最近，我们发现它们偶尔会出现超时错误，我们查看了很多东西，但未能得出结论。

以下是超时错误的详细日志：

181009 18:35:14 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') connection to peer 0a520ba7 with addr tcp://192.168.[censor]:4567 timed out, no messages seen in PT3S
181009 18:35:14 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.[censor]:4567
181009 18:35:15 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') reconnecting to 0a520ba7 (tcp://192.168.[censor]:4567), attempt 0
181009 18:35:17 [Note] WSREP: evs::proto(b1d2aacb, GATHER, view_id(REG,0a520ba7,147)) suspecting node: 0a520ba7
181009 18:35:17 [Note] WSREP: evs::proto(b1d2aacb, GATHER, view_id(REG,0a520ba7,147)) suspected node without join message, declaring inactive
181009 18:35:18 [Note] WSREP: declaring e0d6a63b at tcp://192.168.[censor]:4567 stable
181009 18:35:18 [Note] WSREP: Node b1d2aacb state prim
181009 18:35:18 [Note] WSREP: view(view_id(PRIM,b1d2aacb,148) memb {
    b1d2aacb,0
    e0d6a63b,0
} joined {
} left {
} partitioned {
    0a520ba7,0
})
181009 18:35:18 [Note] WSREP: save pc into disk
181009 18:35:18 [Note] WSREP: forgetting 0a520ba7 (tcp://192.168.[censor]:4567)
181009 18:35:18 [Note] WSREP: deleting entry tcp://192.168.[censor]:4567
181009 18:35:18 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting off
181009 18:35:18 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
181009 18:35:18 [Note] WSREP: STATE_EXCHANGE: sent state UUID: a226ba90-cba6-11e8-af4f-3751d36b7f83
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: sent state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: got state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83 from 0 (CLUSTER002)
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: got state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83 from 1 (CLUSTER003)
181009 18:35:18 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 126,
    members    = 2/2 (joined/total),
    act_id     = 781947656,
    last_appl. = 781947602,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9
181009 18:35:18 [Note] WSREP: Flow-control interval: [23, 23]
181009 18:35:18 [Note] WSREP: New cluster view: global state: efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9:781947656, view# 127: Primary, number of nodes: 2, my index: 0, protocol version 3
181009 18:35:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
181009 18:35:18 [Note] WSREP: REPL Protocols: 7 (3, 2)
181009 18:35:18 [Note] WSREP: Assign initial position for certification: 781947656, protocol version: 3
181009 18:35:18 [Note] WSREP: Service thread queue flushed.
181009 18:35:20 [Note] WSREP:  cleaning up 0a520ba7 (tcp://192.168.[censor]:4567)
181009 18:35:22 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') connection established to 0a520ba7 tcp://192.168.[censor]:4567
181009 18:35:22 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
181009 18:35:25 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting off
181009 18:35:27 [Note] WSREP: declaring 0a520ba7 at tcp://192.168.[censor]:4567 stable
181009 18:35:27 [Note] WSREP: declaring e0d6a63b at tcp://192.168.[censor]:4567 stable
181009 18:35:27 [Note] WSREP: Node b1d2aacb state prim
181009 18:35:27 [Note] WSREP: view(view_id(PRIM,0a520ba7,149) memb {
    0a520ba7,0
    b1d2aacb,0
    e0d6a63b,0
} joined {
} left {
} partitioned {
})
181009 18:35:27 [Note] WSREP: save pc into disk
181009 18:35:27 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: sent state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 0 (CLUSTER004)
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 2 (CLUSTER003)
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 1 (CLUSTER002)
181009 18:35:27 [Note] WSREP: Quorum results:
    version    = 4,
    component  = PRIMARY,
    conf_id    = 127,
    members    = 2/3 (joined/total),
    act_id     = 781948414,
    last_appl. = 781948375,
    protocols  = 0/7/3 (gcs/repl/appl),
    group UUID = efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9
181009 18:35:27 [Note] WSREP: Flow-control interval: [28, 28]
181009 18:35:27 [Note] WSREP: New cluster view: global state: efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9:781948414, view# 128: Primary, number of nodes: 3, my index: 1, protocol version 3
181009 18:35:27 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
181009 18:35:27 [Note] WSREP: REPL Protocols: 7 (3, 2)
181009 18:35:27 [Note] WSREP: Assign initial position for certification: 781948414, protocol version: 3
181009 18:35:27 [Note] WSREP: Service thread queue flushed.
181009 18:35:29 [Note] WSREP: Member 0.0 (CLUSTER004) requested state transfer from '*any*'. Selected 2.0 (CLUSTER003)(SYNCED) as donor.
181009 18:35:29 [Note] WSREP: 2.0 (CLUSTER003): State transfer to 0.0 (CLUSTER004) complete.
181009 18:35:29 [Note] WSREP: Member 2.0 (CLUSTER003) synced with group.
181009 18:35:29 [Note] WSREP: 0.0 (CLUSTER004): State transfer from 2.0 (CLUSTER003) complete.
181009 18:35:29 [Note] WSREP: Member 0.0 (CLUSTER004) synced with group.

我们做了什么：

我们增强了数据库监控，并将数据与其他工作环境进行比较，发现“innodb_checkpoint_age.uncheckpointed_bytes”大约在 4MB ~ 10MB 左右。
我们进行了一些 Traceroute 监控，发现有时，尤其是当 WSREP 进行连接检查时，Ping 会突然达到 >8000ms，甚至一次 >16000ms，而正常情况下应该是 0.2ms。
我们尝试将网络适配器的 rx 和 tx 值从 256/256 调整为 512/512，看看是否有帮助。但无济于事。
互联网上有人说将 MTU 从 9000 更改为 1500 有帮助，但事实并非如此，集群拒绝在 MTU 1500 时启动。
在发生重大事件之前，有些查询速度很慢，所有集群都关闭了，我们必须手动重启它们。虽然我们没有证据或经验不足来确认它与查询速度慢有关。

我不是数据库专家，所以这可能是我能做的最好的了，如果我们遗漏了什么，请回复此帖子，非常感谢。

相关内容