我在诊断 MariaDB 集群问题时遇到了一些困难,我希望得到一些建议。
我们正在运行一个 3 节点 MariaDB 集群,每个节点都在一个专用的 ESXi 服务器上,并通过同一数据中心的本地网络连接。最近,我们发现它们偶尔会出现超时错误,我们查看了很多东西,但未能得出结论。
以下是超时错误的详细日志:
181009 18:35:14 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') connection to peer 0a520ba7 with addr tcp://192.168.[censor]:4567 timed out, no messages seen in PT3S
181009 18:35:14 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.[censor]:4567
181009 18:35:15 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') reconnecting to 0a520ba7 (tcp://192.168.[censor]:4567), attempt 0
181009 18:35:17 [Note] WSREP: evs::proto(b1d2aacb, GATHER, view_id(REG,0a520ba7,147)) suspecting node: 0a520ba7
181009 18:35:17 [Note] WSREP: evs::proto(b1d2aacb, GATHER, view_id(REG,0a520ba7,147)) suspected node without join message, declaring inactive
181009 18:35:18 [Note] WSREP: declaring e0d6a63b at tcp://192.168.[censor]:4567 stable
181009 18:35:18 [Note] WSREP: Node b1d2aacb state prim
181009 18:35:18 [Note] WSREP: view(view_id(PRIM,b1d2aacb,148) memb {
b1d2aacb,0
e0d6a63b,0
} joined {
} left {
} partitioned {
0a520ba7,0
})
181009 18:35:18 [Note] WSREP: save pc into disk
181009 18:35:18 [Note] WSREP: forgetting 0a520ba7 (tcp://192.168.[censor]:4567)
181009 18:35:18 [Note] WSREP: deleting entry tcp://192.168.[censor]:4567
181009 18:35:18 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting off
181009 18:35:18 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
181009 18:35:18 [Note] WSREP: STATE_EXCHANGE: sent state UUID: a226ba90-cba6-11e8-af4f-3751d36b7f83
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: sent state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: got state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83 from 0 (CLUSTER002)
181009 18:35:18 [Note] WSREP: STATE EXCHANGE: got state msg: a226ba90-cba6-11e8-af4f-3751d36b7f83 from 1 (CLUSTER003)
181009 18:35:18 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 126,
members = 2/2 (joined/total),
act_id = 781947656,
last_appl. = 781947602,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9
181009 18:35:18 [Note] WSREP: Flow-control interval: [23, 23]
181009 18:35:18 [Note] WSREP: New cluster view: global state: efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9:781947656, view# 127: Primary, number of nodes: 2, my index: 0, protocol version 3
181009 18:35:18 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
181009 18:35:18 [Note] WSREP: REPL Protocols: 7 (3, 2)
181009 18:35:18 [Note] WSREP: Assign initial position for certification: 781947656, protocol version: 3
181009 18:35:18 [Note] WSREP: Service thread queue flushed.
181009 18:35:20 [Note] WSREP: cleaning up 0a520ba7 (tcp://192.168.[censor]:4567)
181009 18:35:22 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') connection established to 0a520ba7 tcp://192.168.[censor]:4567
181009 18:35:22 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers:
181009 18:35:25 [Note] WSREP: (b1d2aacb, 'tcp://0.0.0.0:4567') turning message relay requesting off
181009 18:35:27 [Note] WSREP: declaring 0a520ba7 at tcp://192.168.[censor]:4567 stable
181009 18:35:27 [Note] WSREP: declaring e0d6a63b at tcp://192.168.[censor]:4567 stable
181009 18:35:27 [Note] WSREP: Node b1d2aacb state prim
181009 18:35:27 [Note] WSREP: view(view_id(PRIM,0a520ba7,149) memb {
0a520ba7,0
b1d2aacb,0
e0d6a63b,0
} joined {
} left {
} partitioned {
})
181009 18:35:27 [Note] WSREP: save pc into disk
181009 18:35:27 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 3
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: sent state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 0 (CLUSTER004)
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 2 (CLUSTER003)
181009 18:35:27 [Note] WSREP: STATE EXCHANGE: got state msg: a7a0c1ba-cba6-11e8-b2c4-7bc932a69143 from 1 (CLUSTER002)
181009 18:35:27 [Note] WSREP: Quorum results:
version = 4,
component = PRIMARY,
conf_id = 127,
members = 2/3 (joined/total),
act_id = 781948414,
last_appl. = 781948375,
protocols = 0/7/3 (gcs/repl/appl),
group UUID = efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9
181009 18:35:27 [Note] WSREP: Flow-control interval: [28, 28]
181009 18:35:27 [Note] WSREP: New cluster view: global state: efec8dfa-4c2b-11e7-8f56-a7bf24f4c9a9:781948414, view# 128: Primary, number of nodes: 3, my index: 1, protocol version 3
181009 18:35:27 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
181009 18:35:27 [Note] WSREP: REPL Protocols: 7 (3, 2)
181009 18:35:27 [Note] WSREP: Assign initial position for certification: 781948414, protocol version: 3
181009 18:35:27 [Note] WSREP: Service thread queue flushed.
181009 18:35:29 [Note] WSREP: Member 0.0 (CLUSTER004) requested state transfer from '*any*'. Selected 2.0 (CLUSTER003)(SYNCED) as donor.
181009 18:35:29 [Note] WSREP: 2.0 (CLUSTER003): State transfer to 0.0 (CLUSTER004) complete.
181009 18:35:29 [Note] WSREP: Member 2.0 (CLUSTER003) synced with group.
181009 18:35:29 [Note] WSREP: 0.0 (CLUSTER004): State transfer from 2.0 (CLUSTER003) complete.
181009 18:35:29 [Note] WSREP: Member 0.0 (CLUSTER004) synced with group.
我们做了什么:
我们增强了数据库监控,并将数据与其他工作环境进行比较,发现“innodb_checkpoint_age.uncheckpointed_bytes”大约在 4MB ~ 10MB 左右。
我们进行了一些 Traceroute 监控,发现有时,尤其是当 WSREP 进行连接检查时,Ping 会突然达到 >8000ms,甚至一次 >16000ms,而正常情况下应该是 0.2ms。
我们尝试将网络适配器的 rx 和 tx 值从 256/256 调整为 512/512,看看是否有帮助。但无济于事。
互联网上有人说将 MTU 从 9000 更改为 1500 有帮助,但事实并非如此,集群拒绝在 MTU 1500 时启动。
在发生重大事件之前,有些查询速度很慢,所有集群都关闭了,我们必须手动重启它们。虽然我们没有证据或经验不足来确认它与查询速度慢有关。
我不是数据库专家,所以这可能是我能做的最好的了,如果我们遗漏了什么,请回复此帖子,非常感谢。