我的 dGPU(Nvidia GTX 880M)看起来坏了,还有希望吗?

我的 dGPU(Nvidia GTX 880M)看起来坏了,还有希望吗?

昨天我开始玩剑士(顺便说一下,这是一款有趣的游戏),过了一段时间它就崩溃了。我没有遇到过热问题(远非如此),我的笔记本电脑没有超频,而且我那台不太新的 GTX 880M 仍然能够运行它,但此时它似乎停止工作了。我关掉了电脑,决定第二天再看看。

第二天(今天),我注意到了以下情况:

  • 计算机在 BIOS 初始化阶段(即加载操作系统之前)需要异常长的时间
  • 启动操作系统后,dGPU 灯会一直亮着,尽管它没有被使用
  • Windows 设备管理器抱怨无法加载驱动程序
  • 在 Linux 上,可以通过 ACPI 调用关闭 dGPU(因此灯会熄灭),但尝试使用它根本不起作用

因此我将 BIOS 重置为所谓的“优化默认值”,希望它会有所帮助,但看起来并没有。

在 Linux 上,当我尝试使用 GTX 880M 时,我收到以下内核消息(为了清晰起见,已删除前后所有内容):

[Aug22 16:11] pci 0000:01:00.0: [10de:1198] type 00 class 0x030000
[  +0.000036] pci 0000:01:00.0: reg 0x10: [mem 0xf6000000-0xf6ffffff]
[  +0.000017] pci 0000:01:00.0: reg 0x14: [mem 0xe0000000-0xefffffff 64bit pref]
[  +0.000016] pci 0000:01:00.0: reg 0x1c: [mem 0xf0000000-0xf1ffffff 64bit pref]
[  +0.000012] pci 0000:01:00.0: reg 0x24: [io  0xe000-0xe07f]
[  +0.000012] pci 0000:01:00.0: reg 0x30: [mem 0xf7000000-0xf707ffff pref]
[  +0.000052] pci 0000:01:00.0: Enabling HDA controller
[  +0.000087] pci 0000:01:00.0: 32.000 Gb/s available PCIe bandwidth, limited by 2.5 GT/s x16 link at 0000:00:01.0 (capable of 126.016 Gb/s with 8 GT/s x16 link)
[  +0.000486] pci 0000:01:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none
[  +0.000011] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem
[  +0.000160] pci 0000:01:00.1: [10de:0e0a] type 00 class 0x040300
[  +0.000029] pci 0000:01:00.1: reg 0x10: [mem 0x00000000-0x00003fff]
[  +0.000057] pci 0000:01:00.1: Max Payload Size set to 256 (was 128, max 256)
[  +0.000382] pcieport 0000:00:01.0: ASPM: current common clock configuration is broken, reconfiguring
[  +0.009663] pci 0000:01:00.0: BAR 1: assigned [mem 0xe0000000-0xefffffff 64bit pref]
[  +0.000037] pci 0000:01:00.0: BAR 3: assigned [mem 0xf0000000-0xf1ffffff 64bit pref]
[  +0.000018] pci 0000:01:00.0: BAR 0: assigned [mem 0xf6000000-0xf6ffffff]
[  +0.000003] pci 0000:01:00.0: BAR 6: assigned [mem 0xf7000000-0xf707ffff pref]
[  +0.000002] pci 0000:01:00.1: BAR 0: assigned [mem 0xf7080000-0xf7083fff]
[  +0.000002] pci 0000:01:00.0: BAR 5: assigned [io  0xe000-0xe07f]
[  +0.000004] pcieport 0000:00:01.0: PCI bridge to [bus 01]
[  +0.000001] pcieport 0000:00:01.0:   bridge window [io  0xe000-0xefff]
[  +0.000003] pcieport 0000:00:01.0:   bridge window [mem 0xf6000000-0xf70fffff]
[  +0.000002] pcieport 0000:00:01.0:   bridge window [mem 0xe0000000-0xf1ffffff 64bit pref]
[  +0.000176] pci 0000:01:00.1: D0 power state depends on 0000:01:00.0
[  +0.000036] snd_hda_intel 0000:01:00.1: enabling device (0000 -> 0002)
[  +0.000057] snd_hda_intel 0000:01:00.1: Disabling MSI
[  +0.000006] snd_hda_intel 0000:01:00.1: Handle vga_switcheroo audio client
[  +0.041484] IPMI message handler: version 39.2
[  +0.016187] ipmi device interface
[  +0.704043] nvidia: module license 'NVIDIA' taints kernel.
[  +0.000001] Disabling lock debugging due to kernel taint
[  +0.012267] nvidia-nvlink: Nvlink Core is being initialized, major device number 237
[  +0.000320] nvidia 0000:01:00.0: enabling device (0006 -> 0007)
[  +0.000078] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[  +0.099429] NVRM: loading NVIDIA UNIX x86_64 Kernel Module  440.100  Fri May 29 08:45:51 UTC 2020
[  +0.055239] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  440.100  Fri May 29 08:14:04 UTC 2020
[  +0.002640] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[  +0.020768] ACPI Warning: \_SB.PCI0.PEG0.PEGP._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20190816/nsarguments-59)
[ +30.855545] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  +0.000034] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.000492] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[  +0.000122] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[  +0.071852] nvidia-uvm: Loaded the UVM driver, major device number 235.
[Aug22 16:14] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[  +3.384946] rfkill: input handler enabled
[  +9.980611] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000047] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.167869] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000075] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.099416] gnome-shell[6474]: segfault at 20 ip 00007f4d29a83356 sp 00007ffd43c62db0 error 4 in libnvidia-glsi.so.440.100[7f4d29a21000+95000]
[  +0.000004] Code: 48 8b 5c 24 08 48 8b 6c 24 10 48 83 c4 18 c3 0f 1f 44 00 00 48 8b 3d f1 31 24 00 e8 74 31 00 00 89 de 48 89 c7 e8 5a fe ff ff <48> 8b 78 20 e8 61 60 01 00 48 83 f8 01 48 89 45 00 19 c0 83 e0 0f
[  +4.409548] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[  +0.281982] rfkill: input handler disabled
[Aug22 16:15] rfkill: input handler enabled
[  +4.303838] Lockdown: systemd-logind: hibernation is restricted; see man kernel_lockdown.7
[  +0.355605] rfkill: input handler disabled
[  +8.056674] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000069] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.172351] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000030] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171805] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000022] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.167969] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000022] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171784] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000070] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.168094] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000023] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[Aug22 16:16] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000031] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.168161] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000025] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171680] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000023] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +8.171964] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x26:0x65:1227)
[  +0.000039] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0

我打算做的下一件事是拔掉所有电源插头,取出电池,然后等待一个小时,但我没有主意了。除了祈祷,还有什么建议吗?

编辑 : 我的笔记本电脑是 Clevo P170SM,但我不认为 Clevo 应该受到指责。也许是我的问题,因为游戏崩溃后 Windows 冻结了,我没有强制关机,而是等待操作系统进行错误检查并自行重启,它确实做到了,但只在可能致命等待了一个小时,我怀疑在这个过程中有东西被烧坏了。

我在这里学到的教训是:不要相信操作系统可以保护你的硬件。大多数硬件保护都是由设备制造商在 BIOS、驱动程序或直接在硬件中实现的,但这些措施有时会失效。

操作系统更注重保护您的数据和系统可用性。​​例如,文件系统日志保护数据,而在 Windows 上,GPU 超时检测和恢复 (TDR) 带来可用性。这是个人观点,但尽管是在驱动程序级别完成的,我认为文件系统日志是操作系统的一部分,因为大多数操作系统(如果不是全部)都是使用非常特定的文件系统格式设计的,并且应该使用非常特定的文件系统格式进行安装。我想 Windows 可以安装在 ext4 文件系统上,但会缺少一些功能。但我离题了……

答案1

[ +30.855545] NVRM: GPU 0000:01:00.0: RmInitAdapter failed! (0x24:0x65:1185)
[  +0.000034] NVRM: GPU 0000:01:00.0: rm_init_adapter failed, device minor number 0
[  +0.000492] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice
[  +0.000122] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device
[  +0.099416] gnome-shell[6474]: segfault at 20 ip 00007f4d29a83356 sp 00007ffd43c62db0 error 4 in libnvidia-glsi.so.440.100[7f4d29a21000+95000]

根据一些相关日志条目,操作系统无法初始化显示适配器。这通常是坏消息,以及您提供的其他信息。Windows 的事件查看器中可能会有类似的硬件错误。

日志的左栏显示消息之间间隔了多长时间。由于初始化 Nvidia 硬件多次失败,您的 PC 加载时间过长。第一次失败需要 30 秒,后续每次失败大约需要 10 秒。

您没有提到笔记本电脑的品牌,而是特别提到“dGPU”,我假设它指的是独立或专用的 Nvidia 图形芯片,您仍然可以使用由 CPU 的集成显卡生成的视频输出。

GPU 可能能够接收电源和基本系统管理请求(您提到的打开和关闭电源),但 GPU 的其余部分无法访问,很可能是由于硬件故障造成的。

我认为这是根据所提供的信息可以做出的最佳评估。

其他可能故障排除方法是尝试从未修改的 Live CD 启动,只是为了绝对确定这不是您安装的操作系统的问题。

如果您在 Google 上搜索日志中与 GPU 相关的错误消息,您会找到 Nvidia 支持论坛的链接。以下是他们的建议:

你能跑吗

nvidia-bug-report.sh

在您的机器上,然后将输出文件通过电子邮件发送到 linux-bugs [at] nvidia.com ?在极少数情况下,此问题会导致您的机器在收集此错误报告时锁定,请运行

nvidia-bug-report.sh --safe-mode

Nvidia 支持论坛

相关内容