由于 NVIDIA 驱动程序导致 Opensuse Tumbleweed 系统硬锁

由于 NVIDIA 驱动程序导致 Opensuse Tumbleweed 系统硬锁

最近更新后,我的笔记本电脑在启动后几个小时内开始随机崩溃。当它崩溃时,最后一个图像保留在我的显示器上,但机器完全没有响应(数字锁定灯不更新)。我使用 Opensuse Tumbleweed,内核 5.12.2-1,Nvidia 驱动程序 460.73.01,Thinkpad P71 配备 Quadro M620 移动 GPU 和 i7-7700HQ。

在一个可能不相关的问题上,网络接口每隔几秒就会不断地启动和关闭。在崩溃的启动中,除了与网络接口异常相关的条目外,崩溃前的许多分钟内都没有 jourenctl 条目。我通过官方扩展坞使用内部以太网卡,由 NetworkManager 管理。这是崩溃前的一个journalctl示例,注意到同样的东西在日志中重复了几个小时:

May 12 07:10:31 thiccboii nscd[1153]: 1153 checking for monitored file `/etc/services': No such file or directory
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.1084] device (enp0s31f6): carrier: link connected
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.1086] device (enp0s31f6): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.1094] policy: auto-activating connection 'Home Ethernet' (e01921b8-0157-3627-bf0c-bbda6a033ae9)
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.1099] device (enp0s31f6): Activation: starting connection 'Home Ethernet' (e01921b8-0157-3627-bf0c-bbda6a033ae9)
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.1101] device (enp0s31f6): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
May 12 07:10:32 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.2154] device (enp0s31f6): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.2221] device (enp0s31f6): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.2225] dhcp4 (enp0s31f6): activation: beginning transaction (timeout in 45 seconds)
May 12 07:10:32 thiccboii NetworkManager[1332]: <info>  [1620828632.2238] dhcp4 (enp0s31f6): dhclient started with pid 27289
May 12 07:10:38 thiccboii NetworkManager[1332]: <info>  [1620828638.2172] device (enp0s31f6): state change: ip-config -> unavailable (reason 'carrier-changed', sys-iface-state: 'managed')
May 12 07:10:38 thiccboii NetworkManager[1332]: <info>  [1620828638.2497] dhcp4 (enp0s31f6): canceled DHCP transaction, DHCP client pid 27289
May 12 07:10:38 thiccboii NetworkManager[1332]: <info>  [1620828638.2497] dhcp4 (enp0s31f6): state changed unknown -> terminated
May 12 07:10:39 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None
May 12 07:10:39 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
May 12 07:10:43 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.8755] device (enp0s31f6): carrier: link connected
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.8758] device (enp0s31f6): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.8765] policy: auto-activating connection 'Home Ethernet' (e01921b8-0157-3627-bf0c-bbda6a033ae9)
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.8770] device (enp0s31f6): Activation: starting connection 'Home Ethernet' (e01921b8-0157-3627-bf0c-bbda6a033ae9)
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.8771] device (enp0s31f6): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.9856] device (enp0s31f6): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.9943] device (enp0s31f6): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.9946] dhcp4 (enp0s31f6): activation: beginning transaction (timeout in 45 seconds)
May 12 07:10:43 thiccboii NetworkManager[1332]: <info>  [1620828643.9959] dhcp4 (enp0s31f6): dhclient started with pid 27301
May 12 07:10:49 thiccboii NetworkManager[1332]: <info>  [1620828649.9870] device (enp0s31f6): state change: ip-config -> unavailable (reason 'carrier-changed', sys-iface-state: 'managed')
May 12 07:10:50 thiccboii NetworkManager[1332]: <info>  [1620828650.0196] dhcp4 (enp0s31f6): canceled DHCP transaction, DHCP client pid 27301
May 12 07:10:50 thiccboii NetworkManager[1332]: <info>  [1620828650.0196] dhcp4 (enp0s31f6): state changed unknown -> terminated
May 12 07:10:50 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None
May 12 07:10:50 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
May 12 07:10:52 thiccboii nscd[1153]: 1153 checking for monitored file `/etc/services': No such file or directory
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.5960] device (enp0s31f6): carrier: link connected
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.5964] device (enp0s31f6): state change: unavailable -> disconnected (reason 'carrier-changed', sys-iface-state: 'managed')
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.5971] policy: auto-activating connection 'Home Ethernet' (e01921b8-0157-3627-bf0c-bbda6a033ae9)
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.5976] device (enp0s31f6): Activation: starting connection 'Home Ethernet' (e01921b8-0157-3627-bf0c-bbda6a033ae9)
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.5978] device (enp0s31f6): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'managed')
May 12 07:10:55 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.7096] device (enp0s31f6): state change: prepare -> config (reason 'none', sys-iface-state: 'managed')
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.7164] device (enp0s31f6): state change: config -> ip-config (reason 'none', sys-iface-state: 'managed')
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.7168] dhcp4 (enp0s31f6): activation: beginning transaction (timeout in 45 seconds)
May 12 07:10:55 thiccboii NetworkManager[1332]: <info>  [1620828655.7181] dhcp4 (enp0s31f6): dhclient started with pid 27309
May 12 07:11:01 thiccboii NetworkManager[1332]: <info>  [1620828661.7110] device (enp0s31f6): state change: ip-config -> unavailable (reason 'carrier-changed', sys-iface-state: 'managed')
May 12 07:11:01 thiccboii NetworkManager[1332]: <info>  [1620828661.7434] dhcp4 (enp0s31f6): canceled DHCP transaction, DHCP client pid 27309
May 12 07:11:01 thiccboii NetworkManager[1332]: <info>  [1620828661.7435] dhcp4 (enp0s31f6): state changed unknown -> terminated
May 12 07:11:03 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None
May 12 07:11:03 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down

此外,在正常关闭期间,journalctl -k 显示此潜在有趣的警告:

May 13 18:07:45 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
May 13 18:07:49 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May 13 18:07:57 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None
May 13 18:07:57 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
May 13 18:08:01 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May 13 18:08:09 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None
May 13 18:08:09 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
May 13 18:08:13 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: Rx/Tx
May 13 18:08:19 thiccboii kernel: ------------[ cut here ]------------
May 13 18:08:19 thiccboii kernel: WARNING: CPU: 6 PID: 16754 at /usr/src/kernel-modules/nvidia-460.73.01-default/nvidia-drm/nvidia-drm-drv.c:531 nv_drm_master_set+0x22/0x30 [nvidia_drm]
May 13 18:08:19 thiccboii kernel: Modules linked in: rfcomm xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT xt_tcpudp nf_nat_tftp nf_conntrack_tftp bridge stp llc nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast ccm nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct af_packet nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security nvidia_drm(POE) nvidia_modeset(POE) ip_set nfnetlink ebtable_filter ebtables nvidia_uvm(POE) ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter nvidia(POE) cmac algif_hash algif_skcipher af_alg bnep dmi_sysfs uas snd_usb_audio snd_usbmidi_lib usb_storage snd_rawmidi snd_seq_device btusb btrtl btbcm btintel bluetooth uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev ecdh_generic mc ecc
May 13 18:08:19 thiccboii kernel:  snd_hda_codec_hdmi intel_rapl_msr intel_rapl_common iwlmvm snd_hda_codec_realtek mac80211 snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi libarc4 snd_hda_codec ee1004 iTCO_wdt intel_pmc_bxt iTCO_vendor_support mei_hdcp snd_hda_core iwlwifi x86_pkg_temp_thermal snd_hwdep intel_powerclamp coretemp thinkpad_acpi pcspkr cfg80211 joydev platform_profile efi_pstore snd_pcm wmi_bmof intel_wmi_thunderbolt i2c_i801 mei_me intel_lpss_pci ledtrig_audio intel_lpss rfkill snd_timer i2c_smbus mei idma64 intel_pch_thermal thermal snd soundcore ac tiny_power_button acpi_pad nls_iso8859_1 nls_cp437 vfat fat fuse binfmt_misc configfs hid_generic usbhid i915 kvm_intel kvm rtsx_pci_sdmmc crct10dif_pclmul crc32_pclmul mmc_core ghash_clmulni_intel aesni_intel i2c_algo_bit e1000e(OE) drm_kms_helper crypto_simd cryptd syscopyarea sysfillrect sysimgblt fb_sys_fops xhci_pci cec xhci_pci_renesas xhci_hcd rc_core rtsx_pci drm nvme serio_raw usbcore nvme_core wmi battery
May 13 18:08:19 thiccboii kernel:  i2c_hid_acpi i2c_hid video pinctrl_sunrisepoint button vfio_mdev mdev vhost_net tun tap vhost vhost_iotlb vfio_pci vfio_virqfd irqbypass vfio_iommu_type1 vfio btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr bbswitch(O) efivarfs
May 13 18:08:19 thiccboii kernel: CPU: 6 PID: 16754 Comm: plymouthd Tainted: P     U     OE     5.12.0-2-default #1 openSUSE Tumbleweed
May 13 18:08:19 thiccboii kernel: Hardware name: LENOVO 20HK0013US/20HK0013US, BIOS N1TET56W (1.30 ) 02/10/2020
May 13 18:08:19 thiccboii kernel: RIP: 0010:nv_drm_master_set+0x22/0x30 [nvidia_drm]
May 13 18:08:19 thiccboii kernel: Code: f4 2c 44 d7 0f 1f 40 00 0f 1f 44 00 00 48 8b 47 38 48 8b 78 20 48 8b 05 9c 5c 00 00 48 8b 40 28 e8 d3 9f 7e d7 84 c0 74 01 c3 <0f> 0b c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 80 3d 7c
May 13 18:08:19 thiccboii kernel: RSP: 0018:ffffb41680933bd0 EFLAGS: 00010246
May 13 18:08:19 thiccboii kernel: RAX: 0000000000000000 RBX: ffff999cd278d000 RCX: 0000000000000008
May 13 18:08:19 thiccboii kernel: RDX: ffffffffc37a7e58 RSI: 0000000000000292 RDI: ffffffffc37a7e20
May 13 18:08:19 thiccboii kernel: RBP: ffff999f567c19c0 R08: 0000000000000008 R09: ffffb41680933bb8
May 13 18:08:19 thiccboii kernel: R10: 0000000000000000 R11: 0000000000000000 R12: ffff999c0cb25800
May 13 18:08:19 thiccboii kernel: R13: 0000000000000000 R14: ffff999c0cb25800 R15: 000000001370a9a8
May 13 18:08:19 thiccboii kernel: FS:  00007fd15d540740(0000) GS:ffff99a577580000(0000) knlGS:0000000000000000
May 13 18:08:19 thiccboii kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 13 18:08:19 thiccboii kernel: CR2: 00007fd15d8fa000 CR3: 000000017f3d2001 CR4: 00000000003706e0
May 13 18:08:19 thiccboii kernel: Call Trace:
May 13 18:08:19 thiccboii kernel:  drm_new_set_master+0x7a/0x100 [drm]
May 13 18:08:19 thiccboii kernel:  drm_master_open+0x68/0x90 [drm]
May 13 18:08:19 thiccboii kernel:  drm_open+0xf5/0x240 [drm]
May 13 18:08:19 thiccboii kernel:  drm_stub_open+0xab/0x130 [drm]
May 13 18:08:19 thiccboii kernel:  chrdev_open+0xed/0x210
May 13 18:08:19 thiccboii kernel:  ? cdev_device_add+0x90/0x90
May 13 18:08:19 thiccboii kernel:  do_dentry_open+0x14e/0x380
May 13 18:08:19 thiccboii kernel:  path_openat+0xaf6/0x10a0
May 13 18:08:19 thiccboii kernel:  ? release_pages+0x153/0x4a0
May 13 18:08:19 thiccboii kernel:  ? flush_tlb_func_common.constprop.0+0x93/0x1e0
May 13 18:08:19 thiccboii kernel:  ? free_unref_page+0x99/0xb0
May 13 18:08:19 thiccboii kernel:  do_filp_open+0x99/0x140
May 13 18:08:19 thiccboii kernel:  ? __check_object_size+0x136/0x150
May 13 18:08:19 thiccboii kernel:  do_sys_openat2+0x97/0x150
May 13 18:08:19 thiccboii kernel:  __x64_sys_openat+0x54/0x90
May 13 18:08:19 thiccboii kernel:  do_syscall_64+0x33/0x80
May 13 18:08:19 thiccboii kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
May 13 18:08:19 thiccboii kernel: RIP: 0033:0x7fd15d7cbffb
May 13 18:08:19 thiccboii kernel: Code: 25 00 00 41 00 3d 00 00 41 00 74 4b 64 8b 04 25 18 00 00 00 85 c0 75 67 44 89 e2 48 89 ee bf 9c ff ff ff b8 01 01 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 91 00 00 00 48 8b 4c 24 28 64 48 2b 0c 25
May 13 18:08:19 thiccboii kernel: RSP: 002b:00007ffd4e3fa1b0 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
May 13 18:08:19 thiccboii kernel: RAX: ffffffffffffffda RBX: 00007fd15d5406c8 RCX: 00007fd15d7cbffb
May 13 18:08:19 thiccboii kernel: RDX: 0000000000000002 RSI: 000056549ac3d730 RDI: 00000000ffffff9c
May 13 18:08:19 thiccboii kernel: RBP: 000056549ac3d730 R08: 000056549ac3c930 R09: 00007fd15d89ea60
May 13 18:08:19 thiccboii kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
May 13 18:08:19 thiccboii kernel: R13: 00007fd15d8c5da8 R14: 0000000000000000 R15: 000056549ac3d080
May 13 18:08:19 thiccboii kernel: ---[ end trace 24fb17530164c622 ]---
May 13 18:08:19 thiccboii kernel: usb 1-4.3.1: reset high-speed USB device number 13 using xhci_hcd
May 13 18:08:21 thiccboii kernel: wlp4s0: deauthenticating from 3c:37:86:14:73:fa by local choice (Reason: 3=DEAUTH_LEAVING)
May 13 18:08:21 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Up 1000 Mbps Half Duplex, Flow Control: None
May 13 18:08:21 thiccboii kernel: e1000e 0000:00:1f.6 enp0s31f6: NIC Link is Down
May 13 18:08:23 thiccboii kernel: kauditd_printk_skb: 44 callbacks suppressed
May 13 18:08:23 thiccboii kernel: audit: type=1305 audit(1620954503.216:16943): op=set audit_pid=0 old=1577 auid=4294967295 ses=4294967295 subj==unconfined res=1
May 13 18:08:23 thiccboii kernel: audit: type=1131 audit(1620954503.216:16944): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=auditd comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
May 13 18:08:23 thiccboii kernel: audit: type=1131 audit(1620954503.216:16945): pid=1 uid=0 auid=4294967295 ses=4294967295 subj==unconfined msg='unit=systemd-tmpfiles-setup comm="systemd" exe="/usr/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'

请让我知道我是否应该去其他地方询问。我已经做了 RAM 测试,我不认为这是具有已知 cstate 问题的 CPU 之一,因为我已经使用这台机器一年多了。故障排除很痛苦,因为崩溃需要两到二十个小时,但我知道是否还有其他我应该尝试的事情。

答案1

在过去的一周里,我看到了类似的事情。这可能是从 kernel-default-5.12.13-1 和 Nvidia 驱动程序 460.84 开始的,但不是在安装后立即出现的,因此它可能与其他更新(plasma、chrome 等)有关。它在内核默认 5.13.0-1.1 上继续发生。在已经稳定运行相当长一段时间的桌面上,这种情况发生了三次。

几年前,Chrome 就为我引发了类似的事情。我已经关闭了 Google chrome-beta 92.0.4515.80-1 GPU 加速的高级选项。到目前为止,我还没有看到另一个锁定。但我现在也使用内核默认 5.13.0-1.2 和 chrome beta 92.0.4515.93-1,因此它们可能会改变情况。

我通常会在 nvidia 论坛中提出这个问题(我发现他们的支持人员过去非常有帮助)。但我会犹豫是否这样做,直到我在日志或 /var/log/Xorg.0.log 中看到模式或有趣的东西。如果您有最近崩溃中的 /var/log/Xorg.0.log,也许其中可能包含一些线索。

相关内容