尽管 memtest86、stress 或 smartctl 没有错误,Linux 仍会冻结

尽管 memtest86、stress 或 smartctl 没有错误,Linux 仍会冻结

自 Linux 5.6 以来,我一直在尝试诊断 Arch 上的冻结问题。我相信我遇到了与 Ryzen 相关的空闲问题使用 zenstates.py 解决但现在我的机器经常会因为分页问题而死机和/或崩溃。我的新论坛帖子已发布这里,并重现日志这里这里. 硬件已编目这里其中一个轨迹的示例如下:

May 28 22:13:31 leo kernel: general protection fault, probably for non-canonical address 0xcff8d364b0b8000: 0000 [#1] PREEMPT SMP NOPTI
May 28 22:13:31 leo kernel: CPU: 0 PID: 350336 Comm: rtorrent main Tainted: G S                5.12.2-arch1-1 #1
May 28 22:13:31 leo kernel: Hardware name: Micro-Star International Co., Ltd. MS-7A40/B450I GAMING PLUS AC (MS-7A40), BIOS A.F3 02/03/2021
May 28 22:13:31 leo kernel: RIP: 0010:lock_page_memcg+0x2e/0xa0
May 28 22:13:31 leo kernel: Code: 00 41 54 55 53 48 8b 47 08 48 89 fb 48 8d 50 ff a8 01 48 0f 45 da e8 81 36 e4 ff 0f 1f 44 00 00 4c 8b 63 38 49 83 e4 fc 74 3e <41> 8b 84 24 80 09 00 00 85 c0 7e 35 49 8d ac 24 40 04 00 00 48 89
May 28 22:13:31 leo kernel: RSP: 0018:ffffa0c28305fb00 EFLAGS: 00010206
May 28 22:13:31 leo kernel: RAX: ffff8d378f618000 RBX: ffffe17e875f8d80 RCX: 0000000000000000
May 28 22:13:31 leo kernel: RDX: c7ffe17e875f4e47 RSI: 0000000000000000 RDI: ffffe17e875f8d80
May 28 22:13:31 leo kernel: RBP: ffffe17e875f8d80 R08: ffff8d379c47d320 R09: 0000000000000140
May 28 22:13:31 leo kernel: R10: ffff8d3a7f33c680 R11: 000000000004baa6 R12: 0cff8d364b0b8000
May 28 22:13:31 leo kernel: R13: 00005576bb233000 R14: ffffa0c28305fc98 R15: 00005576bb234000
May 28 22:13:31 leo kernel: FS:  00007f389e8f7780(0000) GS:ffff8d3a6ee00000(0000) knlGS:0000000000000000
May 28 22:13:31 leo kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 28 22:13:31 leo kernel: CR2: 00007f389f4335b2 CR3: 000000008203a000 CR4: 00000000003506f0
May 28 22:13:31 leo kernel: Call Trace:
May 28 22:13:31 leo kernel:  page_remove_rmap+0x13/0x300
May 28 22:13:31 leo kernel:  unmap_page_range+0x72f/0xe20
May 28 22:13:31 leo kernel:  unmap_vmas+0x83/0x100
May 28 22:13:31 leo kernel:  exit_mmap+0xb5/0x1f0
May 28 22:13:31 leo kernel:  mmput+0x52/0x120
May 28 22:13:31 leo kernel:  begin_new_exec+0x4af/0xa60
May 28 22:13:31 leo kernel:  load_elf_binary+0x734/0x1750
May 28 22:13:31 leo kernel:  ? __kernel_read+0x1e1/0x310
May 28 22:13:31 leo kernel:  bprm_execve+0x273/0x670
May 28 22:13:31 leo kernel:  do_execveat_common+0x192/0x1c0
May 28 22:13:31 leo kernel:  __x64_sys_execve+0x39/0x50
May 28 22:13:31 leo kernel:  do_syscall_64+0x33/0x40
May 28 22:13:31 leo kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 28 22:13:31 leo kernel: RIP: 0033:0x7f389f37364b
May 28 22:13:31 leo kernel: Code: Unable to access opcode bytes at RIP 0x7f389f373621.
May 28 22:13:31 leo kernel: RSP: 002b:00007ffea64abf08 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
May 28 22:13:31 leo kernel: RAX: ffffffffffffffda RBX: 00005576c3b51278 RCX: 00007f389f37364b
May 28 22:13:31 leo kernel: RDX: 00007ffea64aeeb8 RSI: 00007ffea64ad060 RDI: 00007ffea64abf10
May 28 22:13:31 leo kernel: RBP: 00007ffea64abfa0 R08: 0000000000000fff R09: 00007ffea64b0f55
May 28 22:13:31 leo kernel: R10: 00007ffea64abf10 R11: 0000000000000202 R12: 00007ffea64ad060
May 28 22:13:31 leo kernel: R13: 00007ffea64aeeb8 R14: 00007ffea64abf10 R15: 00007ffea64b0f4d
May 28 22:13:31 leo kernel: Modules linked in: ccm 8021q garp mrp stp llc nct6775 hwmon_vid iwlmvm intel_rapl_msr intel_rapl_common amdgpu mac80211 vfat edac_mce_amd fat libarc4 iwlwifi snd_hda_codec_hdmi btusb btrtl btbcm btintel snd_hda_intel kvm snd_intel_dspcfg snd_intel_sdw_acpi bluetooth snd_hda_codec gpu_sched i2c_algo_bit cfg80211 snd_hda_core irqbypass snd_hwdep ecdh_generic drm_ttm_helper crct10dif_pclmul r8169 crc32_pclmul ttm ecc ghash_clmulni_intel wmi_bmof snd_pcm aesni_intel drm_kms_helper cec crypto_simd syscopyarea ccp cryptd snd_timer realtek snd sysfillrect mdio_devres rapl pcspkr rfkill sysimgblt libphy fb_sys_fops soundcore rng_core sp5100_tco i2c_piix4 k10temp wmi video mac_hid gpio_amdpt pinctrl_amd gpio_generic acpi_cpufreq drm sg fuse agpgart bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 crc32c_intel xhci_pci xhci_pci_renesas
May 28 22:13:31 leo kernel: ---[ end trace 1ba6b211d1800231 ]---
May 28 22:13:31 leo kernel: RIP: 0010:lock_page_memcg+0x2e/0xa0
May 28 22:13:31 leo kernel: Code: 00 41 54 55 53 48 8b 47 08 48 89 fb 48 8d 50 ff a8 01 48 0f 45 da e8 81 36 e4 ff 0f 1f 44 00 00 4c 8b 63 38 49 83 e4 fc 74 3e <41> 8b 84 24 80 09 00 00 85 c0 7e 35 49 8d ac 24 40 04 00 00 48 89
May 28 22:13:31 leo kernel: RSP: 0018:ffffa0c28305fb00 EFLAGS: 00010206
May 28 22:13:31 leo kernel: RAX: ffff8d378f618000 RBX: ffffe17e875f8d80 RCX: 0000000000000000
May 28 22:13:31 leo kernel: RDX: c7ffe17e875f4e47 RSI: 0000000000000000 RDI: ffffe17e875f8d80
May 28 22:13:31 leo kernel: RBP: ffffe17e875f8d80 R08: ffff8d379c47d320 R09: 0000000000000140
May 28 22:13:31 leo kernel: R10: ffff8d3a7f33c680 R11: 000000000004baa6 R12: 0cff8d364b0b8000
May 28 22:13:31 leo kernel: R13: 00005576bb233000 R14: ffffa0c28305fc98 R15: 00005576bb234000
May 28 22:13:31 leo kernel: FS:  00007f389e8f7780(0000) GS:ffff8d3a6ee00000(0000) knlGS:0000000000000000
May 28 22:13:31 leo kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 28 22:13:31 leo kernel: CR2: 00007f389f373621 CR3: 000000008203a000 CR4: 00000000003506f0
May 28 22:13:31 leo kernel: note: rtorrent main[350336] exited with preempt_count 1

我已经使用 memtest86 测试了 RAM,使用 Stress 测试了 CPU,并使用 smartctl 检查了磁盘,没有发现任何错误。我尝试了 linux-lts 和主线内核,没有任何 Arch 补丁(虽然补丁不多),但还是出现了同样的问题。我不知道还能做什么来找出我的机器出了什么问题,因为其他人报告说使用 3200G 和 Linux 成功了。我还能做什么来找出问题所在?

相关内容