什么原因导致暂停间歇性失败?

什么原因导致暂停间歇性失败?

挂起和休眠装有 Ubuntu 22.10 的 PC 有时会挂起。显示和输入设备关闭,但 PC 仍保持打开状态,需要硬关机。查看日志时,系统进入睡眠状态后没有看到任何错误。

我尝试在启动参数中添加“no_console_suspend initcall_debug”以获取更多信息,但系统进入睡眠状态后仍然没有报告错误。

*请注意,虽然我目前正在使用 liqourix 内核,但这个问题也发生在原版 22.10 内核和 22.04 上。我在使用相同硬件的 20.04 上没有遇到此问题。在我安装 nvme 驱动器并全新安装 22.04 后,挂起问题开始发生,我最终将其升级到 22.10。

从挂起的暂停的 dmesg 中提取:

Nov 17 15:43:21.726542 MBLPC kernel: ------------[ cut here ]------------
Nov 17 15:43:21.726690 MBLPC kernel: WARNING: CPU: 12 PID: 6060 at kernel/sched/alt_core.c:1539 migrate_enable+0xa9/0xb0
Nov 17 15:43:21.726704 MBLPC kernel: Modules linked in: rfcomm snd_hrtimer xt_MASQUERADE xt_CHECKSUM nft_chain_nat nf_nat vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nfsv3 nfs_acl rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache netfs nvme_fabrics af_packet bridge stp llc cmac algif_hash algif_skcipher af_alg bnep ip6t_REJECT nf_reject_ipv6 xt_hl ip6_tables ip6t_rt ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_comment xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack sunrpc nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables nfnetlink nls_utf8 nls_cp437 vfat xfs fat amdgpu snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio iwlmvm snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi intel_rapl_msr intel_rapl_common snd_hda_codec edac_mce_amd snd_oxygen radeon kvm_amd snd_hda_core snd_oxygen_lib mac80211 snd_mpu401_uart snd_hwdep libarc4 kvm gpu_sched snd_pcm drm_buddy drm_ttm_helper ttm snd_seq_dummy iwlwifi btusb snd_seq_oss drm_display_helper crct10dif_pclmul btrtl
Nov 17 15:43:21.726753 MBLPC kernel:  polyval_clmulni polyval_generic ghash_clmulni_intel btbcm snd_seq_midi snd_seq_midi_event aesni_intel btintel crypto_simd btmtk snd_rawmidi cryptd cec mousedev joydev mxm_wmi xpad wmi_bmof snd_seq cfg80211 bluetooth ff_memless rc_core snd_seq_device k10temp snd_timer drm_kms_helper ccp rng_core snd syscopyarea sysfillrect sysimgblt ecdh_generic fb_sys_fops soundcore agpgart rfkill acpi_cpufreq sg squashfs loop vfio_pci vfio_pci_core vfio_virqfd irqbypass vfio_iommu_type1 vfio msr parport_pc drm ppdev lp parport fuse ramoops reed_solomon efi_pstore ip_tables x_tables ext4 crc16 mbcache jbd2 uas usb_storage btrfs blake2b_generic xor raid6_pq usbhid dm_cache_smq dm_cache dm_persistent_data dm_bio_prison dm_bufio libcrc32c crc32c_generic dm_mod crc32_pclmul crc32c_intel i2c_piix4 igb i2c_algo_bit dca xhci_pci xhci_pci_renesas gpio_amdpt wmi gpio_generic
Nov 17 15:43:21.726780 MBLPC kernel: CPU: 12 PID: 6060 Comm: firefox:cs0 Tainted: G           O       6.0.0-9.1-liquorix-amd64 #1  liquorix 6.0-5ubuntu1~kinetic
Nov 17 15:43:21.726797 MBLPC kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X370 Taichi, BIOS P7.00 01/15/2022
Nov 17 15:43:21.726808 MBLPC kernel: RIP: 0010:migrate_enable+0xa9/0xb0
Nov 17 15:43:21.726819 MBLPC kernel: Code: e9 cc 71 d2 00 83 ea 01 66 89 90 00 01 00 00 31 c0 31 d2 31 c9 e9 b7 71 d2 00 e8 ec 8e f2 ff 31 c0 31 d2 31 c9 e9 a7 71 d2 00 <0f> 0b eb 94 0f 1f 00 0f 1f 44 00 00 8b 05 15 06 98 01 83 f8 ff 74
Nov 17 15:43:21.726830 MBLPC kernel: RSP: 0018:ffffc90016d2be00 EFLAGS: 00010282
Nov 17 15:43:21.726841 MBLPC kernel: RAX: ffff88824b48b500 RBX: 000000007fff0000 RCX: ffff88824b48b5e8
Nov 17 15:43:21.726851 MBLPC kernel: RDX: 000000000000000c RSI: 00000000c000003e RDI: ffffc90016d2be90
Nov 17 15:43:21.726863 MBLPC kernel: RBP: ffff8881d7571b00 R08: 00000000c0186444 R09: 000000000000004b
Nov 17 15:43:21.726874 MBLPC kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffffc90016d2be90
Nov 17 15:43:21.726883 MBLPC kernel: R13: 000000007fff0000 R14: 0000000000000000 R15: ffffc9000ff81000
Nov 17 15:43:21.726895 MBLPC kernel: FS:  00007ff42aba1700(0000) GS:ffff888ffeb00000(0000) knlGS:0000000000000000
Nov 17 15:43:21.726907 MBLPC kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 17 15:43:21.726918 MBLPC kernel: CR2: 00007ff43b928000 CR3: 000000024ad04000 CR4: 0000000000350ee0
Nov 17 15:43:21.726927 MBLPC kernel: Call Trace:
Nov 17 15:43:21.726938 MBLPC kernel:  <TASK>
Nov 17 15:43:21.726948 MBLPC kernel:  __seccomp_filter+0xde/0x870
Nov 17 15:43:21.726957 MBLPC kernel:  ? futex_wake+0x7c/0x180
Nov 17 15:43:21.726970 MBLPC kernel:  syscall_trace_enter.constprop.0+0xa3/0x1b0
Nov 17 15:43:21.726982 MBLPC kernel:  do_syscall_64+0x15/0xc0
Nov 17 15:43:21.726992 MBLPC kernel:  entry_SYSCALL_64_after_hwframe+0x63/0xcd
Nov 17 15:43:21.727003 MBLPC kernel: RIP: 0033:0x7ff44e2c23ab
Nov 17 15:43:21.727012 MBLPC kernel: Code: 0f 1e fa 48 8b 05 e5 7a 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b5 7a 0d 00 f7 d8 64 89 01 48
Nov 17 15:43:21.727022 MBLPC kernel: RSP: 002b:00007ff42aba09e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
Nov 17 15:43:21.727033 MBLPC kernel: RAX: ffffffffffffffda RBX: 00007ff42aba0a50 RCX: 00007ff44e2c23ab
Nov 17 15:43:21.727051 MBLPC kernel: RDX: 00007ff42aba0a50 RSI: 00000000c0186444 RDI: 000000000000004b
Nov 17 15:43:21.727060 MBLPC kernel: RBP: 00000000c0186444 R08: 00007ff42aba0bb0 R09: 0000000000000020
Nov 17 15:43:21.727071 MBLPC kernel: R10: 00007ff42aba0bb0 R11: 0000000000000246 R12: 00007ff4398dcb00
Nov 17 15:43:21.727081 MBLPC kernel: R13: 000000000000004b R14: 0000000000000000 R15: 00007ff35aa45090
Nov 17 15:43:21.727090 MBLPC kernel:  </TASK>
Nov 17 15:43:21.727102 MBLPC kernel: ---[ end trace 0000000000000000 ]---
Nov 17 16:32:22.215527 MBLPC kernel: usb 1-6.1.2: USB disconnect, device number 19
Nov 17 19:38:14.312541 MBLPC kernel: usb 1-2.3: USB disconnect, device number 20
Nov 17 19:38:14.517527 MBLPC kernel: usb 1-2.3: new high-speed USB device number 21 using xhci_hcd
Nov 17 19:38:14.634528 MBLPC kernel: usb 1-2.3: New USB device found, idVendor=3842, idProduct=2608, bcdDevice=a1.18
Nov 17 19:38:14.634754 MBLPC kernel: usb 1-2.3: New USB device strings: Mfr=1, Product=2, SerialNumber=0
Nov 17 19:38:14.634831 MBLPC kernel: usb 1-2.3: Product: EVGA Z15 RGB Gaming Keyboard
Nov 17 19:38:14.634910 MBLPC kernel: usb 1-2.3: Manufacturer: EVGA Corporation
Nov 17 19:38:14.650524 MBLPC kernel: input: EVGA Corporation EVGA Z15 RGB Gaming Keyboard as /devices/pci0000:00/0000:00:01.3/0000:02:00.0/usb1/1-2/1-2.3/1-2.3:1.0/0003:3842:2608.000D/input/input34
Nov 17 19:38:14.702526 MBLPC kernel: hid-generic 0003:3842:2608.000D: input,hidraw3: USB HID v1.11 Keyboard [EVGA Corporation EVGA Z15 RGB Gaming Keyboard] on usb-0000:02:00.0-2.3/input0
Nov 17 19:38:14.710524 MBLPC kernel: input: EVGA Corporation EVGA Z15 RGB Gaming Keyboard Mouse as /devices/pci0000:00/0000:00:01.3/0000:02:00.0/usb1/1-2/1-2.3/1-2.3:1.1/0003:3842:2608.000E/input/input35
Nov 17 19:38:14.710552 MBLPC kernel: input: EVGA Corporation EVGA Z15 RGB Gaming Keyboard Consumer Control as /devices/pci0000:00/0000:00:01.3/0000:02:00.0/usb1/1-2/1-2.3/1-2.3:1.1/0003:3842:2608.000E/input/input36
Nov 17 19:38:14.762526 MBLPC kernel: input: EVGA Corporation EVGA Z15 RGB Gaming Keyboard System Control as /devices/pci0000:00/0000:00:01.3/0000:02:00.0/usb1/1-2/1-2.3/1-2.3:1.1/0003:3842:2608.000E/input/input37
Nov 17 19:38:14.762575 MBLPC kernel: hid-generic 0003:3842:2608.000E: input,hiddev97,hidraw4: USB HID v1.11 Mouse [EVGA Corporation EVGA Z15 RGB Gaming Keyboard] on usb-0000:02:00.0-2.3/input1
Nov 17 19:38:14.768524 MBLPC kernel: input: EVGA Corporation EVGA Z15 RGB Gaming Keyboard as /devices/pci0000:00/0000:00:01.3/0000:02:00.0/usb1/1-2/1-2.3/1-2.3:1.2/0003:3842:2608.000F/input/input39
Nov 17 19:38:14.820529 MBLPC kernel: hid-generic 0003:3842:2608.000F: input,hiddev98,hidraw5: USB HID v1.11 Keyboard [EVGA Corporation EVGA Z15 RGB Gaming Keyboard] on usb-0000:02:00.0-2.3/input2
Nov 17 21:45:37.987528 MBLPC kernel: PM: suspend entry (deep)

更新

我偶然看到了一篇帖子,其中有一个类似的问题,其原因是 nvme 驱动器。回想起来,自从问题出现以来,硬件方面的唯一区别就是安装了 nvme 驱动器。

我注意到“ignore_loglevel”不仅会输出日志,还会在执行挂起和休眠时显示到屏幕上,这对于此特定问题最有用。我决定以这种方式监视错误,并在挂起或休眠故障时密切关注与 nvme 相关的任何错误。

问题是,设置“ignore_loglevel”后,PC 从未出现挂起或休眠失败的情况。我会继续监控它,但根据我执行挂起和休眠循环的次数,它现在应该已经失败了。

我注意到的另一件事是,从暂停状态恢复后,我现在会弹出一个身份验证窗口,要求更新特定驱动器的 SMART 数据。到目前为止,每次出现时都是针对不同的驱动器。

答案1

您的消息似乎已经暗示了您的问题。

WARNING: CPU: 12 PID: 6060 at kernel/sched/alt_core.c:1539 migrate_enable+0xa9/0xb0

尝试按如下所示解决问题:“调试 Linux* 挂起/休眠问题的最佳实践”。 概括:

  1. 初始化调用调试
  2. 禁止控制台暂停
  3. ignore_loglevel

...

  1. pm_测试
  2. ACPI 唤醒
  3. 酸转储
  4. 唤醒
  5. 分析_暂停

我特别建议部分

4 调试挂起/休眠问题

识别并调试您的具体问题。

答案2

因此,有两件事似乎可以作为“解决方案”。通过将 ignore_loglevel 作为启动参数,挂起问题再也没有出现过。我不明白为什么,但我不介意这样做,因为它在执行休眠时将输出显示到屏幕上。对于休眠或恢复需要更长时间的情况,我现在可以知道进程已到达哪一步,而不是看着空白屏幕。

正如上次更新中提到的,自出现挂起问题以来,唯一的硬件差异是安装了 nvme 驱动器。在看到一个有类似问题的论坛帖子后,我尝试通过 /proc/acpi/wakeup 禁用 nvme 设备的唤醒状态,并执行了多次挂起和休眠测试。所有测试都成功了。我想说的是,在测试这个之前,我确实删除了 ignore_loglevel 参数。我知道需要一个脚本来使这个更改永久生效(https://unix.stackexchange.com/questions/417956/make-changes-to-proc-acpi-wakeup-permanent)。

相关内容