GPU 似乎在唤醒时使 Arch 崩溃

GPU 似乎在唤醒时使 Arch 崩溃

症状:
合上盖子后,笔记本电脑成功进入睡眠状态(风扇停止运转即可证明)。经常醒来(十分之一)一切都会冻结。X跑步时会发生这种情况ratpoison 但是也在两个虚拟终端之间切换时(没有 X 运行;使用 ++clrl切换)。altF2

这件事应该如何调查?其他数据:

操作系统:

[miro@katana ~]$ uname -a
Linux katana 5.19.2-arch1-1 #1 SMP PREEMPT_DYNAMIC Wed, 17 Aug 2022 13:48:51 +0000 x86_64 GNU/Linux

相关部分来自dmesg

[    6.340112] nvidia-nvlink: Nvlink Core is being initialized, major device number 510

[    6.340135] traps: Missing ENDBR: _nv011437rm+0x0/0x10 [nvidia]
[    6.340371] ------------[ cut here ]------------
[    6.340371] kernel BUG at arch/x86/kernel/traps.c:253!
[    6.340375] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[    6.340379] CPU: 13 PID: 328 Comm: systemd-modules Tainted: P           OE     5.19.2-arch1-1 #1 1368c994e25e19983709ee8b14ef7d9de0c6
a97a
[    6.340383] Hardware name: Micro-Star International Co., Ltd. Katana GF76 11UC/MS-17L2, BIOS E17L2IMS.30F 12/02/2021
[    6.340384] RIP: 0010:exc_control_protection+0xc2/0xd0
[    6.340390] Code: 8b 93 80 00 00 00 be fa 00 00 00 48 c7 c7 56 4f c8 aa e8 b1 1e 4c ff e9 72 ff ff ff 48 c7 c7 3d 4f c8 aa e8 c4 15 f
b ff 0f 0b <0f> 0b 66 66 2e 0f 1f 84 00 00 00 00 00 90 66 0f 1f 00 55 53 48 89
[    6.340393] RSP: 0018:ffffc2370055bb88 EFLAGS: 00010002
[    6.340395] RAX: 0000000000000033 RBX: ffffc2370055bba8 RCX: 0000000000000027
[    6.340397] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffa0612fb61660
[    6.340398] RBP: 0000000000000003 R08: 0000000000000000 R09: ffffc2370055ba20
[    6.340400] R10: 0000000000000003 R11: ffffffffab4cb428 R12: 0000000000000000
[    6.340401] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    6.340403] FS:  00007f9a7ace64c0(0000) GS:ffffa0612fb40000(0000) knlGS:0000000000000000
[    6.340405] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.340406] CR2: 00007f9a78f81000 CR3: 0000000100b36002 CR4: 0000000000f70ee0
[    6.340408] PKRU: 55555554
[    6.340409] Call Trace:
[    6.340410]  <TASK>
[    6.340412]  asm_exc_control_protection+0x26/0x30
[    6.340415] RIP: 0010:_nv011437rm+0x0/0x10 [nvidia]
[    6.340646] Code: 66 2e 0f 1f 84 00 00 00 00 00 48 83 ec 08 e8 c7 12 1e 00 48 83 c4 08 48 89 c7 e9 bb ff ff ff 66 2e 0f 1f 84 00 00 00 00 00 90 <48> 89 f7 e9 18 08 00 00 0f 1f 84 00 00 00 00 00 48 89 f7 e9 18 08
[    6.340649] RSP: 0018:ffffc2370055bc58 EFLAGS: 00010202
[    6.340650] RAX: ffffffffc186e8c0 RBX: ffffffffc3a853b0 RCX: 0000000000000000
[    6.340652] RDX: 00000000000d592e RSI: 0000000000000010 RDI: ffffffffc3a853b0
[    6.340654] RBP: ffffa05de7fc5fe0 R08: 0000000000000020 R09: ffffffffc3a853f0
[    6.340655] R10: ffffffffc3a3c0f0 R11: 0000000000000000 R12: 0000000000000010
[    6.340657] R13: ffffa05de7fc3000 R14: 00007f9a7b1f9343 R15: ffffc2370055bdd8
[    6.340659]  ? _nv034928rm+0x20/0x20 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.340888]  _nv011435rm+0x24/0xe0 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341114]  _nv034929rm+0xe/0xa0 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341341]  _nv034932rm+0x1d/0x30 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341565]  _nv034934rm+0x2f/0x40 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341790]  _nv015577rm+0x15/0x70 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.341913]  _nv000643rm+0x9/0x20 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342034]  ? cdev_add+0x50/0x70
[    6.342037]  rm_init_rm+0x17/0x60 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342227]  nvidia_init_module+0x242/0x616 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342368]  ? nvidia_init_module+0x616/0x616 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342503]  nvidia_frontend_init_module+0x50/0x94 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342641]  ? nvidia_init_module+0x616/0x616 [nvidia 180c1458287a5b0a18d0491f7bc4adc4fd70ea8b]
[    6.342775]  do_one_initcall+0x5a/0x220
[    6.342780]  do_init_module+0x4a/0x1e0
[    6.342783]  __do_sys_init_module+0x138/0x1b0
[    6.342785]  do_syscall_64+0x5c/0x90
[    6.342789]  ? handle_mm_fault+0xb2/0x280
[    6.342791]  ? do_user_addr_fault+0x1db/0x690
[    6.342795]  ? exc_page_fault+0x74/0x170
[    6.342796]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[    6.342800] RIP: 0033:0x7f9a7b0e6ace
[    6.342803] Code: 48 8b 0d d5 f2 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d a2 f2 0c 00 f7 d8 64 89 01 48
[    6.342805] RSP: 002b:00007ffcb8c96d08 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    6.342808] RAX: ffffffffffffffda RBX: 0000558a439e89d0 RCX: 00007f9a7b0e6ace
[    6.342809] RDX: 00007f9a7b1f9343 RSI: 0000000003ce1720 RDI: 00007f9a752a0010
[    6.342811] RBP: 00007f9a7b1f9343 R08: 0000558a439e88d0 R09: 0000000000000000
[    6.342812] R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000020000
[    6.342814] R13: 0000558a439e8aa0 R14: 0000558a439e89d0 R15: 0000558a439e8c00
[    6.342816]  </TASK>
[    6.342817] Modules linked in: intel_rapl_msr(+) wmi_bmof(+) sparse_keymap(+) pcc_cpufreq(-) fjes(-) acpi_cpufreq(-) gpio_keys pmt_class snd_pcm_dmaengine mac80211(+) kvm(+) snd_hda_intel irqbypass libarc4 snd_intel_dspcfg crct10dif_pclmul nvidia(POE+) btusb crc32_pclmul snd_intel_sdw_acpi ghash_clmulni_intel iwlwifi btrtl aesni_intel snd_hda_codec uvcvideo btbcm crypto_simd snd_hda_core iwlmei btintel videobuf2_vmalloc cryptd intel_cstate btmtk snd_hwdep processor_thermal_device_pci_legacy videobuf2_memops i915(+) intel_uncore videobuf2_v4l2 cfg80211 processor_thermal_device snd_pcm psmouse spi_intel_pci r8169 bluetooth pcspkr videobuf2_common drm_buddy processor_thermal_rfim spi_intel snd_timer ttm realtek processor_thermal_mbox snd videodev i2c_i801 mei_me mdio_devres drm_display_helper vfat tpm_crb processor_thermal_rapl ecdh_generic intel_lpss_pci soundcore i2c_smbus fat libphy cec intel_rapl_common rfkill mc tpm_tis intel_lpss mei int340x_thermal_zone idma64 intel_gtt i2c_hid_acpi
[    6.342846]  tpm_tis_core intel_vsec wmi intel_soc_dts_iosf i2c_hid tpm soc_button_array rng_core mac_hid int3400_thermal acpi_pad acpi_tad acpi_thermal_rel video crypto_user fuse bpf_preload ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 serio_raw atkbd libps2 vivaldi_fmap nvme xhci_pci crc32c_intel nvme_core i8042 xhci_pci_renesas serio
[    6.342893] R10: 0000000000000003 R11: ffffffffab4cb428 R12: 0000000000000000
[    6.342895] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[    6.342896] FS:  00007f9a7ace64c0(0000) GS:ffffa0612fb40000(0000) knlGS:0000000000000000
[    6.342898] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    6.342900] CR2: 00007f9a78f81000 CR3: 0000000100b36002 CR4: 0000000000f70ee0
[    6.342901] PKRU: 55555554

司机:

[miro@katana ~]$ pacman -Ss nvidia | grep installed
extra/egl-wayland 2:1.1.10-1 [installed]
extra/ffnvcodec-headers 11.1.5.1-2 [installed]
extra/libvdpau 1.5-1 [installed]
extra/nvidia 515.65.01-8 [installed]
extra/nvidia-utils 515.65.01-2 [installed]
community/nvtop 2.0.2-1 [installed]

硬件:微星 Katana GF76 11UC

盖上盖子但风扇运转10分钟后;重新开放并且:

[root@katana miro]# shutdown now
Failed to power off system via logind: There's already a shutdown or sleep operation in progress

对我来说听起来像是一些udev巫术,但我对此一无所知。


最近,一些错误消息(不幸的是我丢弃了)提到了nvidia-sleep.sh挂起,从而阻止了系统电源状态切换。经过检查,VT 似乎是罪魁祸首。这是文件:

#!/bin/bash

if [ ! -f /proc/driver/nvidia/suspend ]; then
    exit 0
fi

RUN_DIR="/var/run/nvidia-sleep"
XORG_VT_FILE="${RUN_DIR}"/Xorg.vt_number

PATH="/bin:/usr/bin"

case "$1" in
    suspend|hibernate)
        mkdir -p "${RUN_DIR}"
        fgconsole > "${XORG_VT_FILE}"
        chvt 63
        if [[ $? -ne 0 ]]; then
            exit $?
        fi
        echo "$1" > /proc/driver/nvidia/suspend
        exit $?
        ;;
    resume)
        echo "$1" > /proc/driver/nvidia/suspend 
        #
        # Check if Xorg was determined to be running at the time
        # of suspend, and whether its VT was recorded.  If so,
        # attempt to switch back to this VT.
        #
        if [[ -f "${XORG_VT_FILE}" ]]; then
            XORG_PID=$(cat "${XORG_VT_FILE}")
            rm "${XORG_VT_FILE}"
            chvt "${XORG_PID}"
        fi
        exit 0
        ;;
    *)
        exit 1
esac

答案1

多个实例使nvidia-sleep.sh resume系统无法进行电源管理(sudo shutdown now失败)。由于感觉超出了我的深度,我实施了以下野蛮的“临时”解决方法:

resume)
++    exit 0

我只在大约 30 个睡眠周期上对其进行了测试(中间没有重新启动):对于他们来说,它工作得无可挑剔。

无论如何,有什么意义呢?为什么要切换回 Xorg 的 VT?如果我在多个 VT 上有 Xorg 实例怎么办?如果我正在控制台 VT 上工作并且想要从我离开的地方醒来该怎么办?对我来说毫无意义。

相关内容