我有一台运行 CentOS Linux 版本 7.4.1708(核心)的 HP DL360 G7,并以随机间隔和(在我看来)随机原因遭受内核恐慌。我使用这些作为指南深入研究了故障转储:https://www.slideshare.net/PaulVNovarese/linux-crash-dump-capture-and-analysis和https://www.dedoimedo.com/computers/crash-analyze.html
但在我看来,故障转储每次都指向不同的原因,即,我没有足够的能力来理解故障转储来确定它们是否有任何联系。
我设法获得的所有故障转储。粘贴包含崩溃命令 sys、bt 和每个转储的日志。
编辑
由于建议的编辑,我粘贴了命令的sys
和输出。我将保留 Pastebin 链接,因为的输出太大,无法放在这里。bt
crash
log
crash
还补充说服务器正在运行 ECC 内存,并且日志中没有出现 MCE 的指示,因此我不相信这是内存错误的情况。
任何帮助或指导如何解决这个问题将不胜感激。
KERNEL: /usr/lib/debug/lib/modules/3.10.0-693.11.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2018-01-03-03:08:48/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Wed Jan 3 03:07:40 2018
UPTIME: 28 days, 00:57:45
LOAD AVERAGE: 3.45, 2.43, 2.66
TASKS: 714
NODENAME: server
RELEASE: 3.10.0-693.11.1.el7.x86_64
VERSION: #1 SMP Mon Dec 4 23:52:40 UTC 2017
MACHINE: x86_64 (2666 Mhz)
MEMORY: 72 GB
PANIC: "general protection fault: 0000 [#1] SMP "
bt
输出
PID: 24892 TASK: ffff8808f9111fa0 CPU: 0 COMMAND: "python"
#0 [ffff8808fba03910] machine_kexec at ffffffff8105c52b
#1 [ffff8808fba03970] __crash_kexec at ffffffff81104a42
#2 [ffff8808fba03a40] crash_kexec at ffffffff81104b30
#3 [ffff8808fba03a58] oops_end at ffffffff816ad338
#4 [ffff8808fba03a80] die at ffffffff8102e97b
#5 [ffff8808fba03ab0] do_general_protection at ffffffff816accbe
#6 [ffff8808fba03ae0] general_protection at ffffffff816ac568
[exception RIP: inet6_csk_search_req+261]
RIP: ffffffff81673385 RSP: ffff8808fba03b98 RFLAGS: 00010202
RAX: 0000000000001c9e RBX: 3932383931333431 RCX: ffff8807ed7918c8
RDX: 00000000ffffffff RSI: 0000000000000000 RDI: 00000000b1fa562e
RBP: ffff8808fba03bb0 R8: ffff8807ed7918d8 R9: 0000000000000001
R10: ffff8801f3ffc000 R11: 00000000b80dcf9c R12: ffff880783d27178
R13: ffff8808fba03bd0 R14: ffff8808ec21a9a8 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ffff8808fba03bb8] tcp_v6_do_rcv at ffffffff8166d73a
#8 [ffff8808fba03c10] tcp_v6_rcv at ffffffff8166e1b2
#9 [ffff8808fba03cc0] ip6_input_finish at ffffffff81643712
#10 [ffff8808fba03d00] ip6_input at ffffffff81643fe3
#11 [ffff8808fba03d58] ip6_rcv_finish at ffffffff81643518
#12 [ffff8808fba03d70] ipv6_rcv at ffffffff81643d99
#13 [ffff8808fba03df0] __netif_receive_skb_core at ffffffff81586f22
#14 [ffff8808fba03e60] __netif_receive_skb at ffffffff81587188
#15 [ffff8808fba03e80] process_backlog at ffffffff8158841e
#16 [ffff8808fba03ec0] net_rx_action at ffffffff8158799d
#17 [ffff8808fba03f40] __do_softirq at ffffffff81090b4f
#18 [ffff8808fba03fb0] call_softirq at ffffffff816b6b1c
--- <IRQ stack> ---
#19 [ffff880c1fcb3980] local_bh_enable at ffffffff81090017
#20 [ffff880c1fcb3990] __dev_queue_xmit at ffffffff815895a5
#21 [ffff880c1fcb39e8] local_bh_enable at ffffffff81090017
#22 [ffff880c1fcb39f8] ip6_finish_output2 at ffffffff816407b1
#23 [ffff880c1fcb3a78] ip6_finish_output at ffffffff81642cbc
#24 [ffff880c1fcb3aa0] ip6_output at ffffffff81642d77
#25 [ffff880c1fcb3b00] ip6_xmit at ffffffff81640039
#26 [ffff880c1fcb3ba8] inet6_csk_xmit at ffffffff81673059
#27 [ffff880c1fcb3c48] tcp_transmit_skb at ffffffff815e7c9f
#28 [ffff880c1fcb3cb8] tcp_connect at ffffffff815e97ed
#29 [ffff880c1fcb3d38] tcp_v6_connect at ffffffff8166c106
#30 [ffff880c1fcb3e08] __inet_stream_connect at ffffffff81605725
#31 [ffff880c1fcb3e80] inet_stream_connect at ffffffff816059d8
#32 [ffff880c1fcb3eb0] SYSC_connect at ffffffff8156a497
#33 [ffff880c1fcb3f70] sys_connect at ffffffff8156b29e
#34 [ffff880c1fcb3f80] system_call_fastpath at ffffffff816b5089
RIP: 00007f9e58f319d0 RSP: 00007ffece4029e0 RFLAGS: 00010246
RAX: 000000000000002a RBX: ffffffff816b5089 RCX: 00007f9e59144c28
RDX: 000000000000001c RSI: 00007ffece4030d0 RDI: 0000000000000003
RBP: 0000000000000000 R8: 0000000000000000 R9: 00000000032dfb60
R10: 0000000000000006 R11: 0000000000000246 R12: ffffffff8156b29e
R13: ffff880c1fcb3f78 R14: 0000000002b56510 R15: 00000000025950a0
ORIG_RAX: 000000000000002a CS: 0033 SS: 002b
KERNEL: /usr/lib/debug/lib/modules/3.10.0-693.11.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2017-12-06-02:05:41/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Wed Dec 6 02:04:35 2017
UPTIME: 15 days, 00:23:09
LOAD AVERAGE: 6.23, 4.95, 3.77
TASKS: 726
NODENAME: server
RELEASE: 3.10.0-693.5.2.el7.x86_64
VERSION: #1 SMP Fri Oct 20 20:32:50 UTC 2017
MACHINE: x86_64 (2666 Mhz)
MEMORY: 72 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 00000000000008d0"
bt
输出
PID: 31570 TASK: ffff8811f9f9dee0 CPU: 11 COMMAND: "mysqld"
#0 [ffff8800988ff340] machine_kexec at ffffffff8105c4cb
#1 [ffff8800988ff3a0] __crash_kexec at ffffffff81104a42
#2 [ffff8800988ff470] crash_kexec at ffffffff81104b30
#3 [ffff8800988ff488] oops_end at ffffffff816ad338
#4 [ffff8800988ff4b0] no_context at ffffffff8169d35a
#5 [ffff8800988ff500] __bad_area_nosemaphore at ffffffff8169d3f0
#6 [ffff8800988ff548] bad_area at ffffffff8169d714
#7 [ffff8800988ff570] __do_page_fault at ffffffff816b02fc
#8 [ffff8800988ff5d0] do_page_fault at ffffffff816b03a5
#9 [ffff8800988ff600] page_fault at ffffffff816ac5c8
[exception RIP: xfs_fs_destroy_inode+78]
RIP: ffffffffc03d6f3e RSP: ffff8800988ff6b0 RFLAGS: 00010202
RAX: 0000000000000000 RBX: ffff8809781d00f4 RCX: 0000000000020100
RDX: 000000000000000b RSI: ffff8809781d01d8 RDI: ffff8809781d0000
RBP: ffff8800988ff6c8 R8: 0000000000000000 R9: 09781d01f00c0000
R10: f669e55c17347c03 R11: 0000000000000000 R12: ffff8809781d0000
R13: ffff8809781d0150 R14: ffff8811f97c2108 R15: ffff880cd7cbfe08
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#10 [ffff8800988ff6d0] evict at ffffffff8121d658
#11 [ffff8800988ff710] iget_locked at ffffffff8121d83e
#12 [ffff8800988ff738] prune_icache_sb at ffffffff8121e834
#13 [ffff8800988ff7f0] vmpressure at ffffffff811f7451
#14 [ffff8800988ff878] do_try_to_free_pages at ffffffff811985a2
#15 [ffff8800988ff8f0] try_to_free_pages at ffffffff811987bc
#16 [ffff8800988ff988] __alloc_pages_slowpath at ffffffff8169fbcb
#17 [ffff8800988ffa78] __alloc_pages_nodemask at ffffffff8118cdb5
#18 [ffff8800988ffb28] alloc_pages_current at ffffffff811d1078
#19 [ffff8800988ffb70] alloc_skb_with_frags at ffffffff81573cbd
#20 [ffff8800988ffbc0] sock_alloc_send_pskb at ffffffff8156c9b9
#21 [ffff8800988ffc48] unix_stream_sendmsg at ffffffff8163aef0
#22 [ffff8800988ffcf8] sock_sendmsg at ffffffff8156a580
#23 [ffff8800988ffe58] SYSC_sendto at ffffffff8156a731
#24 [ffff8800988fff70] sys_sendto at ffffffff8156b2ce
#25 [ffff8800988fff80] system_call_fastpath at ffffffff816b5089
RIP: 00007fca94f07c0b RSP: 00007fca7fed7308 RFLAGS: 00000202
RAX: 000000000000002c RBX: ffffffff816b5089 RCX: 00000000014c8a7c
RDX: 0000000000004000 RSI: 00007fca6ebdb008 RDI: 0000000000000163
RBP: 0000000000000000 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000040 R11: 0000000000000246 R12: ffffffff8156b2ce
R13: ffff8800988fff78 R14: 00007fca6ebef238 R15: 0000000000004000
ORIG_RAX: 000000000000002c CS: 0033 SS: 002b
KERNEL: /usr/lib/debug/lib/modules/3.10.0-693.11.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2017-11-14-04:21:30/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Tue Nov 14 04:20:23 2017
UPTIME: 09:25:26
LOAD AVERAGE: 2.21, 1.47, 1.71
TASKS: 788
NODENAME: server
RELEASE: 3.10.0-693.2.2.el7.x86_64
VERSION: #1 SMP Tue Sep 12 22:26:13 UTC 2017
MACHINE: x86_64 (2666 Mhz)
MEMORY: 72 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000008"
bt
输出
PID: 29737 TASK: ffff8811f6346eb0 CPU: 3 COMMAND: "kworker/3:0"
#0 [ffff8811ac617948] machine_kexec at ffffffff8105c4cb
#1 [ffff8811ac6179a8] __crash_kexec at ffffffff81104a32
#2 [ffff8811ac617a78] crash_kexec at ffffffff81104b20
#3 [ffff8811ac617ab8] no_context at ffffffff8169d2ba
#4 [ffff8811ac617b08] no_context at ffffffff8169d350
#5 [ffff8811ac617b50] __bad_area_nosemaphore at ffffffff8169d4ba
#6 [ffff8811ac617b60] __do_page_fault at ffffffff816b017e
#7 [ffff8811ac617bc0] __do_page_fault at ffffffff816b0325
#8 [ffff8811ac617bf0] general_protection at ffffffff816ac548
[exception RIP: xlog_write+772]
RIP: ffffffffc03f8644 RSP: ffff8811ac617ca0 RFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff8811af615870 RCX: 0000000000000000
RDX: 00000000000261a0 RSI: 0000000000000000 RDI: 0000000000005e6c
RBP: ffff8811ac617d38 R8: 0000000000000000 R9: ffffc9000d576078
R10: ffffc9000d57606c R11: ffff8808f53ad800 R12: 0000000000007e00
R13: 00000000000000d0 R14: 0000000000000000 R15: ffff880c9c7bb000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff8811ac617d40] xlog_cil_push at ffffffffc03fa1f8 [xfs]
#10 [ffff8811ac617e10] xlog_cil_push_work at ffffffffc03fa395 [xfs]
#11 [ffff8811ac617e20] process_one_work at ffffffff810a881a
#12 [ffff8811ac617e68] worker_thread at ffffffff810a94e6
#13 [ffff8811ac617ec8] kthread at ffffffff810b098f
#14 [ffff8811ac617f50] save_rest at ffffffff816b4f58
KERNEL: /usr/lib/debug/lib/modules/3.10.0-693.11.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2017-11-13-18:50:12/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Mon Nov 13 18:49:05 2017
UPTIME: 10 days, 17:30:03
LOAD AVERAGE: 2.10, 2.43, 2.33
TASKS: 785
NODENAME: server
RELEASE: 3.10.0-693.2.2.el7.x86_64
VERSION: #1 SMP Tue Sep 12 22:26:13 UTC 2017
MACHINE: x86_64 (2666 Mhz)
MEMORY: 72 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at 0000000000000020"
bt
输出
PID: 27951 TASK: ffff8808f1b25ee0 CPU: 16 COMMAND: "php-cgi"
#0 [ffff880113643a90] machine_kexec at ffffffff8105c4cb
#1 [ffff880113643af0] __crash_kexec at ffffffff81104a32
#2 [ffff880113643bc0] crash_kexec at ffffffff81104b20
#3 [ffff880113643c00] no_context at ffffffff8169d2ba
#4 [ffff880113643c50] no_context at ffffffff8169d350
#5 [ffff880113643c98] mm_fault_error at ffffffff8169d674
#6 [ffff880113643cc0] __do_page_fault at ffffffff816b027c
#7 [ffff880113643d20] __do_page_fault at ffffffff816b0325
#8 [ffff880113643d50] general_protection at ffffffff816ac548
[exception RIP: xfs_free_eofblocks+91]
RIP: ffffffffc03cf8db RSP: ffff880113643e00 RFLAGS: 00010212
RAX: 0000000000000000 RBX: ffff8811f3bdcec0 RCX: 000000000000000c
RDX: 0000000000000fff RSI: 0000000000000001 RDI: ffff8811f3bdcec0
RBP: ffff880113643e58 R8: 0000000000000000 R9: 0000000000000000
R10: ffff8811f3bdd010 R11: ffff880017c7b710 R12: 0000000000015a0b
R13: ffff8808f7346000 R14: ffff8808f4ba00c0 R15: ffff880170ad2d20
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff880113643e60] xfs_release at ffffffffc03e7105 [xfs]
#10 [ffff880113643e88] xfs_file_release at ffffffffc03d65a5 [xfs]
#11 [ffff880113643e98] delayed_fput at ffffffff81202fc9
#12 [ffff880113643ee0] alloc_file at ffffffff8120322e
#13 [ffff880113643ef0] task_work_run at ffffffff810ad247
#14 [ffff880113643f30] do_notify_resume at ffffffff8102ab62
#15 [ffff880113643f50] int_with_check at ffffffff816b52bd
RIP: 00007fd2f72abe90 RSP: 00007ffe5d25d848 RFLAGS: 00000246
RAX: 0000000000000000 RBX: 000055bac5cda3c0 RCX: ffffffffffffffff
RDX: 0000000000000008 RSI: 0000000000000001 RDI: 0000000000000003
RBP: 0000000000000000 R8: 000055bac5cda4a0 R9: 00007fd2fa6bb840
R10: 000000000000000d R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000001 R14: 00007fd2e8771f58 R15: 0000000000000000
ORIG_RAX: 0000000000000003 CS: 0033 SS: 002b
KERNEL: /usr/lib/debug/lib/modules/3.10.0-693.11.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2017-11-03-01:14:25/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Fri Nov 3 01:13:19 2017
UPTIME: 3 days, 18:53:59
LOAD AVERAGE: 2.66, 2.58, 2.16
TASKS: 744
NODENAME: server
RELEASE: 3.10.0-693.2.2.el7.x86_64
VERSION: #1 SMP Tue Sep 12 22:26:13 UTC 2017
MACHINE: x86_64 (2666 Mhz)
MEMORY: 72 GB
PANIC: "BUG: unable to handle kernel NULL pointer dereference at (null)"
bt
输出
PID: 144 TASK: ffff8808faf52f70 CPU: 7 COMMAND: "kswapd1"
#0 [ffff8808f9e0f7e8] machine_kexec at ffffffff8105c4cb
#1 [ffff8808f9e0f848] __crash_kexec at ffffffff81104a32
#2 [ffff8808f9e0f918] crash_kexec at ffffffff81104b20
#3 [ffff8808f9e0f958] no_context at ffffffff8169d2ba
#4 [ffff8808f9e0f9a8] no_context at ffffffff8169d350
#5 [ffff8808f9e0f9f0] __bad_area_nosemaphore at ffffffff8169d4ba
#6 [ffff8808f9e0fa00] __do_page_fault at ffffffff816b017e
#7 [ffff8808f9e0fa08] __radix_tree_create at ffffffff81328a9e
#8 [ffff8808f9e0fa60] __do_page_fault at ffffffff816b0325
#9 [ffff8808f9e0fa90] general_protection at ffffffff816ac548
[exception RIP: crc32_generic_combine+89]
RIP: ffffffff8133db39 RSP: ffff8808f9e0fb40 RFLAGS: 00010207
RAX: 0000000000000000 RBX: ffff88010b138008 RCX: dead000000000200
RDX: 0000000000000000 RSI: ffff88010b8bc570 RDI: ffff88010b138008
RBP: ffff8808f9e0fb40 R8: e018000000000000 R9: 010b8bc5700c0000
R10: fed6747d72e55c03 R11: 0000000000000000 R12: ffff88010b138000
R13: ffff8808f47dbcb8 R14: ffff8811f973f108 R15: ffff880143a76b48
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#10 [ffff8808f9e0fb48] file_has_perm at ffffffff812b752d
#11 [ffff8808f9e0fb88] evict at ffffffff8121d5e2
#12 [ffff8808f9e0fbb8] iget_locked at ffffffff8121d81a
#13 [ffff8808f9e0fbe0] iget_locked at ffffffff8121d8ce
#14 [ffff8808f9e0fc08] new_inode_pseudo at ffffffff8121e8c4
#15 [ffff8808f9e0fc70] ns_set_super at ffffffff81203878
#16 [ffff8808f9e0fca8] shrink_slab at ffffffff81195413
#17 [ffff8808f9e0fcc0] vmpressure_register_event at ffffffff811f7547
#18 [ffff8808f9e0fd48] balance_pgdat at ffffffff81199081
#19 [ffff8808f9e0fe20] kswapd at ffffffff81199323
#20 [ffff8808f9e0fe78] wake_up_atomic_t at ffffffff810b1910
#21 [ffff8808f9e0fea8] balance_pgdat at ffffffff811991b0
#22 [ffff8808f9e0fec8] kthread at ffffffff810b098f
#23 [ffff8808f9e0ff50] save_rest at ffffffff816b4f58
https://pastebin.ca/3955189
KERNEL: /usr/lib/debug/lib/modules/3.10.0-693.11.1.el7.x86_64/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2017-10-30-07:14:38/vmcore [PARTIAL DUMP]
CPUS: 24
DATE: Mon Oct 30 06:13:31 2017
UPTIME: 3 days, 06:21:18
LOAD AVERAGE: 1.03, 1.19, 1.36
TASKS: 707
NODENAME: server
RELEASE: 3.10.0-693.2.2.el7.x86_64
VERSION: #1 SMP Tue Sep 12 22:26:13 UTC 2017
MACHINE: x86_64 (2666 Mhz)
MEMORY: 72 GB
PANIC: "general protection fault: 0000 [#1] SMP "
bt
输出
PID: 32005 TASK: ffff8808f8bfeeb0 CPU: 4 COMMAND: "vtund"
#0 [ffff8804623fb950] machine_kexec at ffffffff8105c4cb
#1 [ffff8804623fb9b0] __crash_kexec at ffffffff81104a32
#2 [ffff8804623fba80] crash_kexec at ffffffff81104b20
#3 [ffff8804623fbac0] die at ffffffff8102e97b
#4 [ffff8804623fbaf0] do_general_protection at ffffffff816acc3e
#5 [ffff8804623fbb20] xen_int3 at ffffffff816ac4e8
[exception RIP: memcmp]
RIP: ffffffff8132b980 RSP: ffff8804623fbbd0 RFLAGS: 00010286
RAX: ffff8801f3cd7048 RBX: 000000000000000a RCX: ffff8806917dc048
RDX: ffff8808fb333800 RSI: ffff8808f9f482e8 RDI: 0236303633043a00
RBP: ffff8804623fbc40 R8: 0000000000000036 R9: 0000000000000000
R10: ffff88017fc03400 R11: 0000000000000000 R12: ffffffff8197ea30
R13: 0236303633043a00 R14: ffff8808fb333868 R15: ffff8808a4f6a300
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0000
#6 [ffff8804623fbbd0] insert_header at ffffffff8127cb99
#7 [ffff8804623fbc48] register_leaf_sysctl_tables at ffffffff8127d28b
#8 [ffff8804623fbc60] bstr_printf at ffffffff8132fb59
#9 [ffff8804623fbd40] ipv6_add_dev at ffffffff8164784a
#10 [ffff8804623fbd70] addrconf_notify at ffffffff8164c9c9
#11 [ffff8804623fbd78] sysfs_slab_alias at ffffffff811df8d6
#12 [ffff8804623fbd88] dropmon_net_event at ffffffff815ab686
#13 [ffff8804623fbde0] trace_do_page_fault at ffffffff816b051c
#14 [ffff8804623fbe18] raw_notifier_call_chain at ffffffff810b68d6
#15 [ffff8804623fbe28] call_netdevice_notifiers_info at ffffffff8158306d
#16 [ffff8804623fbe50] register_netdevice at ffffffff8158c576
#17 [ffff8804623fbe88] __tun_chr_ioctl at ffffffffc07fca8e [tun]
#18 [ffff8804623fbf20] tun_chr_compat_ioctl at ffffffffc07fd10b [tun]
#19 [ffff8804623fbf30] compat_sys_ioctl at ffffffff8125d73b
#20 [ffff8804623fbf80] cstar_tracesys at ffffffff816b746c
RIP: 00000000080d458c RSP: 00000000fff1980c RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff816b746c RCX: 00000000400454ca
RDX: 00000000fff19840 RSI: 00000000fff19840 RDI: 00000000fff19860
RBP: 00000000fff19878 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
ORIG_RAX: 0000000000000036 CS: 0023 SS: 002b
答案1
做一个内存测试。你的问题是内存损坏。内存错误可能会导致随机的、难以诊断的错误。