在过去的一个月左右,我似乎收到了随机的内核错误。我开始注意到一种模式:从跟踪来看,调用跟踪总是涉及 mmap 函数。
每当发生其中一种情况时,它所处的进程(下面跟踪中的 Chromium)就会挂起,尝试终止它SIGKILL
只会导致kill
命令也挂起。为了恢复系统的稳定性,我必须完全关闭盒子并重新启动。
直到最近内核更新后,计算机才会随机完全关闭。没有警告,日志中也没有任何内容。谢天谢地,这种情况似乎已经停止了。
问题:这是否表明存在硬件问题?mmap 失败表明存在 RAM 问题(尽管我运行了 memcheck 超过 12 个小时,没有出现任何错误)。或者这真的只是内核中的一个错误?如果是这样,我该怎么办?
$ uname -a
Linux [name] 3.11.0-15-generic #23-Ubuntu SMP Mon Dec 9 18:17:04 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
追踪自dmesg
:
[252563.113569] BUG: unable to handle kernel paging request at 0000020000000018
[252563.113589] IP: [<ffffffff811619e0>] vma_interval_tree_insert+0x30/0x90
[252563.113607] PGD 0
[252563.113612] Oops: 0000 [#1] SMP
[252563.113620] Modules linked in: serpent_avx_x86_64 serpent_sse2_x86_64 serpent_generic twofish_generic twofish_avx_x86_64 twofish_x86_64_3way twofish_x86_64 twofish_common xts hidp pci_stub vboxpci(OF) vboxnetadp(OF) vboxnetflt(OF) vboxdrv(OF) vmw_vsock_vmci_transport vsock vmw_vmci parport_pc ppdev rfcomm bnep binfmt_misc usblp x86_pkg_temp_thermal kvm_intel kvm eeepc_wmi asus_wmi sparse_keymap snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel joydev snd_hda_codec btusb bluetooth cdc_acm snd_hwdep snd_pcm microcode snd_page_alloc snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq snd_seq_device snd_timer psmouse snd serio_raw mei_me mei lpc_ich soundcore mac_hid coretemp lp parport dm_crypt raid10 raid456 async_memcpy async_raid6_recov async_pq async_xor async_tx xor hid_generic raid6_pq raid0 multipath linear hid_logitech_dj usbhid hid raid1 mxm_wmi radeon crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd i2c_algo_bit ttm ahci libahci drm_kms_helper e1000e drm video ptp pps_core wmi
[252563.113870] CPU: 3 PID: 13428 Comm: Chrome_IOThread Tainted: GF O 3.11.0-15-generic #23-Ubuntu
[252563.113890] Hardware name: ASUS All Series/MAXIMUS VI HERO, BIOS 0224 04/25/2013
[252563.113906] task: ffff88079bc9aee0 ti: ffff880768020000 task.ti: ffff880768020000
[252563.113922] RIP: 0010:[<ffffffff811619e0>] [<ffffffff811619e0>] vma_interval_tree_insert+0x30/0x90
[252563.113943] RSP: 0018:ffff880768021d90 EFLAGS: 00010206
[252563.113954] RAX: 0000020000000000 RBX: ffff8806d7f4c980 RCX: 0000000000000000
[252563.113969] RDX: ffff88079bb7bd70 RSI: ffff88079bb7bd70 RDI: ffff88038fa57c38
[252563.113984] RBP: ffff880768021d98 R08: 000000000000007f R09: 0000000000000000
[252563.114000] R10: ffff88038fa57c38 R11: 00007f3f14132000 R12: ffff88038fa57c38
[252563.114015] R13: ffff880100babae8 R14: ffff880100babaf0 R15: ffff88079bb7bd88
[252563.114030] FS: 00007f3f4fffe700(0000) GS:ffff88081ecc0000(0000) knlGS:0000000000000000
[252563.114047] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[252563.114059] CR2: 0000020000000018 CR3: 00000007ed0b8000 CR4: 00000000001407e0
[252563.114074] Stack:
[252563.114079] ffffffff8116b698 ffff880768021dd8 ffffffff8116c275 ffff880100babac8
[252563.114097] ffff880100babaf0 00007f3f140b2000 ffff880100babae8 ffff8806daf9fd00
[252563.114114] ffff880100babac8 ffff880768021e60 ffffffff8116e77c ffff8806daf9fd00
[252563.114132] Call Trace:
[252563.114139] [<ffffffff8116b698>] ? __vma_link_file+0x48/0x80
[252563.114153] [<ffffffff8116c275>] vma_link+0x75/0xc0
[252563.114164] [<ffffffff8116e77c>] mmap_region+0x48c/0x610
[252563.114177] [<ffffffff8116ec05>] do_mmap_pgoff+0x305/0x3c0
[252563.114190] [<ffffffff8115a3fd>] vm_mmap_pgoff+0x8d/0xc0
[252563.114202] [<ffffffff8116d253>] SyS_mmap_pgoff+0x1d3/0x270
[252563.114215] [<ffffffff81017402>] SyS_mmap+0x22/0x30
[252563.114227] [<ffffffff816f721d>] system_call_fastpath+0x1a/0x1f
[252563.114240] Code: 48 8b 47 08 48 2b 07 49 89 fa 4c 8b 8f 98 00 00 00 48 89 f2 31 c9 48 c1 e8 0c 4d 8d 44 01 ff eb 27 66 2e 0f 1f 84 00 00 00 00 00 <4c> 39 40 18 73 04 4c 89 40 18 4c 3b 48 40 48 8d 48 08 48 8d 50
[252563.114312] RIP [<ffffffff811619e0>] vma_interval_tree_insert+0x30/0x90
[252563.114327] RSP <ffff880768021d90>
[252563.114335] CR2: 0000020000000018
[252563.117845] ---[ end trace eb82b12e51fc5733 ]---
答案1
由于你已经运行记忆测试经过足够长的时间,最明显的硬件嫌疑人已被排除。我认为你已经注意到
BUG: unable to handle kernel paging request at 0000020000000018
每次都携带相同或不同的地址,对吗?
我无法帮助您完成这份报告,但我建议您使用阿波特收集有关您的崩溃的信息?阿波特是 Ubuntu 官方的崩溃和错误数据收集软件包,你会发现一个这里有很好的介绍。
您需要激活它,(编辑为 sudo /etc/apport/crashdb.conf,找到此行,
'problem_types': ['Bug', 'Package'],
并在开头添加一个井号#),它将产生导致崩溃的调用的完整跟踪。无需担心限制在较新版本的 Ubuntu 中,由于阿波特即使设置为 0,也能够规避其指示。
总的来说,最好的办法是将崩溃报告上传到 Launchpad;Apport 会自动执行此操作。但有些信息甚至对没有经验的用户也可能有帮助。上面引用的简介指出:
Some fields warrant further details:
SegvAnalysis: when examining a Segmentation Fault (signal 11), Apport attempts to review the exact machine instruction that caused the fault, and checks the program counter, source, and destination addresses, looking for any virtual memory address (VMA) that is outside an allocated range (as reported in the ProcMaps attachment).
SegvReason: a VMA can be read from, written to, or executed. On a SegFault, one of these 3 CPU actions has taken place at a given VMA that either not allocated, or lacks permissions to perform the action. For example:
SegvReason: reading NULL VMA would mean that a NULL pointer was most likely dereferenced while reading a value.
SegvReason: writing unknown VMA would mean that something was attempting to write to the destination of a pointer aimed outside of allocated memory. (This is sometimes a security issue.)
SegvReason: executing writable VMA [stack] would mean that something was causing code on the stack to be executed, but the stack (correctly) lacked execute permissions. (This is almost always a security issue.)
过去,这曾让我能够精确定位导致崩溃的程序(VirtualBox)。在彻底清除并重新安装后,问题就消失了。我只希望你也能有同样的好运。