我在 Galera 集群中设置了三台 MariaDB 服务器。我每次使用一台服务器作为“主要”主服务器(即 Galera 仅用于故障转移,应用程序不会主动使用多台主服务器)。
大约每两周左右,主服务器就会发生故障。集群中的其他两台服务器都很好,我可以重新启动崩溃的服务器,它就能正常恢复。
我切换了三台服务器作为“主要”主机,无论我选择哪台服务器都会崩溃。所以这似乎不太可能与硬件有关。
问题是——为什么会发生这种情况?我该如何追踪它?我是否应该将其作为错误提交给 MariaDB?
2015-04-09 02:02:38 7f788745a700 InnoDB: Assertion failure in thread 140155642291968 in file rem0rec.cc line 580
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
150409 2:02:38 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
To report this bug, see http://kb.askmonty.org/en/reporting-bugs
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Server version: 10.0.16-MariaDB-1~trusty-wsrep-log
key_buffer_size=52428800
read_buffer_size=131072
max_used_connections=128
max_threads=402
thread_count=11
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 934441 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x0x7f75176b3008
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7f7887459df0 thread_stack 0x30000
150409 2:02:44 [Warning] WSREP: last inactive check more than PT1.5S ago (PT5.98149S), skipping check
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: view(view_id(NON_PRIM,70802785-d454-11e4-9152-2b6d076ff37a,26) memb {
c86d2afe-da1f-11e4-befa-264d853d1e46,0
} joined {
} left {
} partitioned {
70802785-d454-11e4-9152-2b6d076ff37a,0
e18a3f1a-c314-11e4-a25a-c6a751e32d91,0
})
150409 2:02:44 [Note] WSREP: view(view_id(NON_PRIM,c86d2afe-da1f-11e4-befa-264d853d1e46,27) memb {
c86d2afe-da1f-11e4-befa-264d853d1e46,0
} joined {
} left {
} partitioned {
70802785-d454-11e4-9152-2b6d076ff37a,0
e18a3f1a-c314-11e4-a25a-c6a751e32d91,0
})
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') address 'tcp://192.168.178.10:4567' pointing to uuid c86d2afe-da1f-11e4-befa-264d853d1e46 is blacklisted, skipping
150409 2:02:44 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150409 2:02:44 [Note] WSREP: Flow-control interval: [16, 16]
150409 2:02:44 [Note] WSREP: Received NON-PRIMARY.
150409 2:02:44 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 497086935)
150409 2:02:44 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150409 2:02:44 [Note] WSREP: Flow-control interval: [16, 16]
150409 2:02:44 [Note] WSREP: Received NON-PRIMARY.
150409 2:02:44 [Note] WSREP: New cluster view: global state: ec05ddd0-c265-11e4-b715-e69a238eb511:497086935, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
150409 2:02:44 [Warning] WSREP: Send action {(nil), 250, TORDERED} returned -107 (Transport endpoint is not connected)
150409 2:02:44 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150409 2:02:44 [Note] WSREP: New cluster view: global state: ec05ddd0-c265-11e4-b715-e69a238eb511:497086935, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
150409 2:02:44 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150409 2:02:44 [Note] WSREP: (c86d2afe-da1f-11e4-befa-264d853d1e46, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.177.11:4567 tcp://192.168.179.12:4567
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x7f7898d74c7e]
/usr/sbin/mysqld(handle_fatal_signal+0x457)[0x7f78988ac8a7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x10340)[0x7f7897059340]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39)[0x7f78966b0cc9]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f78966b40d8]
/usr/sbin/mysqld(+0x8832eb)[0x7f7898b9f2eb]
/usr/sbin/mysqld(+0x8858ff)[0x7f7898ba18ff]
/usr/sbin/mysqld(+0x802c9e)[0x7f7898b1ec9e]
/usr/sbin/mysqld(+0x892af5)[0x7f7898baeaf5]
/usr/sbin/mysqld(+0x895133)[0x7f7898bb1133]
/usr/sbin/mysqld(+0x8bece8)[0x7f7898bdace8]
/usr/sbin/mysqld(+0x8c3361)[0x7f7898bdf361]
/usr/sbin/mysqld(+0x8c3c27)[0x7f7898bdfc27]
/usr/sbin/mysqld(+0x8a4689)[0x7f7898bc0689]
/usr/sbin/mysqld(+0x804fb7)[0x7f7898b20fb7]
/usr/sbin/mysqld(_ZN7handler13ha_delete_rowEPKh+0x3f7)[0x7f78988b7b27]
/usr/sbin/mysqld(_Z12mysql_deleteP3THDP10TABLE_LISTP4ItemP10SQL_I_ListI8st_orderEyyP13select_result+0xf3e)[0x7f78989f047e]
/usr/sbin/mysqld(_Z21mysql_execute_commandP3THD+0x23cb)[0x7f7898723fcb]
/usr/sbin/mysqld(+0x40f7b7)[0x7f789872b7b7]
/usr/sbin/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x1ebb)[0x7f789872dd1b]
/usr/sbin/mysqld(_Z10do_commandP3THD+0x20f)[0x7f789872e9bf]
/usr/sbin/mysqld(_Z24do_handle_one_connectionP3THD+0x1fb)[0x7f78987fcbcb]
/usr/sbin/mysqld(handle_one_connection+0x40)[0x7f78987fcdb0]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x8182)[0x7f7897051182]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f789677447d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x7f750940f020): is an invalid pointer
Connection ID (thread ID): 25689442
Status: NOT_KILLED
Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
150409 02:02:46 mysqld_safe Number of processes running now: 0
150409 02:02:46 mysqld_safe WSREP: not restarting wsrep node automatically
150409 02:02:46 mysqld_safe mysqld from pid file /var/run/mysqld/mysqld.pid ended
答案1
是的。始终将堆栈跟踪作为错误提交给 mariadb。
我没看到任何类似的报告。我肯定会先更新到最新的稳定版本 10.0。
尝试在启用 log-slave-updates 和二进制日志的情况下运行。这应该有助于识别导致崩溃的 SQL 语句。