RTX 2070 + Fedora 31:nvidia_modeset 错误导致登录前计算机冻结

RTX 2070 + Fedora 31:nvidia_modeset 错误导致登录前计算机冻结

我正在使用以下 Linux 系统试验一个错误 - 接下来我将对其进行描述:

Fedora 31 - 内核:5.5.17-200.fc31.x86_64 GPU 为 ASUS RTX 2070 SUPER,驱动程序为 440.82,还有 CUDA 10.1 和 CuDNN 7.6.5 i7 8 8700 32GB 内存(2dims)Aorus Pro 主板

我的 RTX 2070 无头运行:我的显示器连接到板载视频卡。

GPU 运行完美,我可以用 Tensorflow GPU 训练神经网络,用 GPU 玩游戏,我想我的硬件没有问题,从来没有出现过一次冻结,温度也一直很好。问题是,每 3 或 4 次,当我启动计算机时,它会在进入桌面之前冻结。他们的键盘坏了,我无法移动到不同的 TTY,按下键时 CAPS LOCK 灯也不会亮起,我只能按住电源按钮 7 秒钟并强制关闭计算机。

当发生这种情况时,登录后,我将在 Fedora 的问题报告中看到一条错误消息:

A kernel problem occurred, but your kernel has been tainted (flags:POE). Explanation:
P - Proprietary module has been loaded.
O - Out-of-tree module has been loaded.
E - Unsigned module has been loaded.
Kernel maintainers are unable to diagnose tainted reports. Tainted modules: nvidia_drm,nvidia_modeset,nvidia.

如果我点击“详细信息”,我会看到以下所有报告:

reason   WARNING: CPU: 2 PID: 820451 at mm/vmalloc.c:2282 __vunmap+0x1e9/0x210 [nvidia_modeset]
z

WARNING: CPU: 2 PID: 820451 at mm/vmalloc.c:2282 __vunmap+0x1e9/0x210
Modules linked in: nvidia_uvm(OE) ipmi_devintf rfcomm ccm xt_CHECKSUM xt_MASQUERADE nf_nat_tftp nf_conntrack_tftp tun bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast xt_CT ip6t_REJECT nf_reject_ipv6 ip6t_rpfilter ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat iptable_mangle iptable_raw iptable_security nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter cmac bnep sunrpc vfat fat intel_rapl_msr intel_rapl_common snd_sof_pci snd_sof_intel_byt snd_sof_intel_ipc snd_sof_xtensa_dsp snd_sof_intel_hda_common snd_soc_hdac_hda snd_sof_intel_hda x86_pkg_temp_thermal intel_powerclamp snd_sof coretemp snd_soc_skl kvm_intel snd_soc_sst_ipc snd_soc_sst_dsp snd_hda_ext_core snd_soc_acpi_intel_match snd_hda_codec_hdmi snd_soc_acpi kvm snd_hda_codec_realtek ucsi_ccg snd_hda_codec_generic typec_ucsi
 irqbypass iTCO_wdt ledtrig_audio iTCO_vendor_support iwlmvm snd_soc_core mei_hdcp typec snd_compress ac97_bus snd_pcm_dmaengine snd_hda_intel snd_intel_dspcfg crct10dif_pclmul mac80211 crc32_pclmul snd_usb_audio snd_hda_codec libarc4 ghash_clmulni_intel snd_hda_core snd_usbmidi_lib intel_cstate intel_uncore iwlwifi intel_rapl_perf btusb snd_rawmidi snd_hwdep btrtl snd_seq btbcm pcspkr btintel wmi_bmof intel_wmi_thunderbolt snd_seq_device i2c_i801 cfg80211 bluetooth joydev snd_pcm mc snd_timer snd mei_me ecdh_generic ecc mei rfkill soundcore i2c_nvidia_gpu ie31200_edac intel_pch_thermal acpi_pad ip_tables nvidia_drm(POE) nvidia_modeset(POE) nvidia(POE) i915 ipmi_msghandler i2c_algo_bit mxm_wmi drm_kms_helper e1000e crc32c_intel nvme drm nvme_core wmi video pinctrl_cannonlake pinctrl_intel fuse
CPU: 2 PID: 820451 Comm: kworker/u24:6 Tainted: P        W  OE     5.5.17-200.fc31.x86_64 #1
Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS PRO WIFI/Z390 AORUS PRO WIFI-CF, BIOS F10 06/05/2019
Workqueue: events_unbound async_run_entry_fn
RIP: 0010:__vunmap+0x1e9/0x210
Code: 41 5d 41 5e e9 78 37 03 00 31 d2 31 f6 48 c7 c7 ff ff ff ff e8 d8 fc ff ff eb b5 48 89 fe 48 c7 c7 e8 36 39 9c e8 f9 51 e3 ff <0f> 0b 5b 5d 41 5c 41 5d 41 5e c3 4c 89 e6 48 c7 c7 10 37 39 9c e8
RSP: 0000:ffffb9e60c6afcd8 EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff9372771d9008 RCX: 0000000000000007
RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffff93729e299cc0
RBP: 0000000000000a20 R08: 00007cae180e423a R09: ffffffff9d25fc64
R10: 000000000000061e R11: 000000000001f074 R12: ffff936cb80e9a20
R13: 0000000000000004 R14: ffff9372771da008 R15: ffff9372771d9008
FS:  0000000000000000(0000) GS:ffff93729e280000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000007cb60a001 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 _nv002417kms+0xea/0x150 [nvidia_modeset]
 ? _nv000328kms+0x2d/0x1d0 [nvidia_modeset]
 ? _nv002227kms+0x2d5/0x6d0 [nvidia_modeset]
 ? nvKmsResume+0x43/0x80 [nvidia_modeset]
 ? nvkms_resume+0x1b/0x40 [nvidia_modeset]
 ? nvidia_resume+0x67/0x70 [nvidia]
 ? pci_pm_thaw+0x80/0x80
 ? nv_pmops_resume+0xf/0x20 [nvidia]
 ? dpm_run_callback+0x4f/0x140
 ? device_resume+0x136/0x200
 ? async_resume+0x19/0x50
 ? async_run_entry_fn+0x39/0x160
 ? process_one_work+0x1b4/0x370
 ? worker_thread+0x50/0x3c0
 ? kthread+0xf9/0x130
 ? process_one_work+0x370/0x370
 ? kthread_park+0x90/0x90
 ? ret_from_fork+0x35/0x40

我甚至不太明白这些报告都在说什么,但看起来 Nvidia 驱动模块出现了故障,而且有人试图分配内存(malloc)但失败了。

您能帮我调试这个错误吗?请告诉我是否有任何其他与我的系统、日志或任何我可以分享的相关信息。

这只会在启动时发生,系统在运行时完美运行,我可以连续数小时训练模型,玩游戏,运行时从未出现过一次崩溃或冻结。

提前致谢!!

编辑 2:刚刚将我的 Aorus Pro WIFI z390 主板更新到最新的 BIOS 版本 12c。错误仍然出现。

编辑 1:将“nomodeset”添加到我的内核参数会导致系统在进入图形环境之前 100% 冻结。以防万一,这是我的 grub 配置文件 (/etc/sysconfig/grub):

GRUB_TIMEOUT=5
GRUB_DISTRIBUTOR="$(sed 's, release .*$,,g' /etc/system-release)"
GRUB_DEFAULT=saved
GRUB_DISABLE_SUBMENU=true
GRUB_TERMINAL_OUTPUT="console"
GRUB_CMDLINE_LINUX="resume=/dev/mapper/fedora_localhost--live-swap rd.lvm.lv=fedora_localhost-live/root rd.lvm.lv=fedora_localhost-live/swap rhgb quiet rd.driver.blacklist=nouveau"
GRUB_DISABLE_RECOVERY="true"
GRUB_ENABLE_BLSCFG=true

相关内容