- 设备:HP ProBook 470 G4
- 集成 GPU:英特尔高清显卡 620
- 专用 GPU:NVIDIA GeForce 930MX
我的笔记本电脑刚从服务中心回来(因为 CPU 故障)。在 CPU 出现故障之前,一切都运行良好。现在,我安装了 Ubuntu 20.04 和专有的 NVIDIA 驱动程序。
笔记:我尝试了每个版本的驱动程序。我的 GPU 支持390、418、430、435、440、450 和 455。还有一件奇怪的事情……当我安装 440 时,APT 会安装 450。430 和 418 也发生了同样的情况。435 被 455 取代了。无论如何,这是我的问题:
当我启动笔记本电脑时,它卡在黑屏上在 gdm3 启动之前。我甚至无法切换 TTY。只有 SSH 可以工作。当我收到 dmesg 日志时,我看到了以下内容:
[ 16.620560] ACPI Warning: \_SB.PCI0.RP01.PXSX._DSM: Argument #4 type mismatch - Found
[Buffer], ACPI requires [Package] (20200528/nsarguments-59)
[ 17.126534] r8169 0000:02:00.0 enp2s0: Link is Up - 100Mbps/Full - flow control off
[ 17.126546] IPv6: ADDRCONF(NETDEV_CHANGE): enp2s0: link becomes ready
[ 18.695141] pcieport 0000:00:1c.0: AER: Uncorrected (Non-Fatal) error received: 0000:00:1c.0
[ 18.695154] pcieport 0000:00:1c.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
[ 18.695159] pcieport 0000:00:1c.0: AER: device [8086:9d10] error status/mask=00100000/00010000
[ 18.695161] pcieport 0000:00:1c.0: AER: [20] UnsupReq (First)
[ 18.695164] pcieport 0000:00:1c.0: AER: TLP Header: 34000000 00000010 00000000 00000000
[ 18.695173] nvidia 0000:01:00.0: AER: can't recover (no error_detected callback)
[ 18.695208] pcieport 0000:00:1c.0: AER: device recovery failed
[ 18.699191] NVRM: GPU at PCI:0000:01:00: GPU-9fe5f99e-479c-1100-e75b-dc4310990232
[ 18.699194] NVRM: Xid (PCI:0000:01:00): 79, pid=1521, GPU has fallen off the bus.
[ 18.699197] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[ 18.699206] NVRM: A GPU crash dump has been created. If possible, please run
NVRM: nvidia-bug-report.sh as root to collect this data before
NVRM: the NVIDIA kernel module is unloaded.
[ 19.031183] EXT4-fs (dm-3): mounted filesystem with ordered data mode. Opts: (null)
[ 19.423191] irq 16: nobody cared (try booting with the "irqpoll" option)
[ 19.423195] CPU: 3 PID: 0 Comm: swapper/3 Tainted: P OE 5.8.0-050800-generic #202008022230
[ 19.423195] Hardware name: HP HP ProBook 470 G4/8234, BIOS P85 Ver. 01.37 10/19/2020
[ 19.423196] Call Trace:
[ 19.423197] <IRQ>
[ 19.423202] dump_stack+0x70/0x8d
[ 19.423205] __report_bad_irq+0x3a/0xaf
[ 19.423206] note_interrupt.cold+0x8/0x60
[ 19.423208] handle_irq_event+0xaa/0xb1
[ 19.423208] handle_fasteoi_irq+0x7d/0x1c0
[ 19.423210] asm_call_on_stack+0x12/0x20
[ 19.423211] </IRQ>
[ 19.423213] common_interrupt+0xbc/0x160
[ 19.423214] asm_common_interrupt+0x1e/0x40
[ 19.423215] RIP: 0010:poll_idle+0x9b/0xb9
[ 19.423217] Code: 44 89 e8 41 5c 41 5d 41 5e 41 5f 5d c3 4c 89 f7 48 89 de e8 77 71 dd ff 49 89 c6 b8 c9 00 00 00 49 8b 17 83 e2 08 75 b1 f3 90 <83> e8 01 75 f1 65 8b 3d 59 97 04 63 e8 34 f6 51 ff 4c 29 e0 4c 39
[ 19.423217] RSP: 0018:ffffa8f3000ffe10 EFLAGS: 00000246
[ 19.423218] RAX: 0000000000000020 RBX: ffff9b8bc05b7500 RCX: 000000000000001f
[ 19.423219] RDX: 0000000000000000 RSI: ffff9b8bc05b7500 RDI: ffffffff9df6d760
[ 19.423219] RBP: ffffa8f3000ffe38 R08: 0000000485b61e74 R09: 0000000000000001
[ 19.423220] R10: 0000000000000003 R11: ffff9b8bc05ab364 R12: 0000000485b61e74
[ 19.423221] R13: 0000000000000000 R14: 00000000000007d0 R15: ffff9b8bb5300000
[ 19.423223] cpuidle_enter_state+0x81/0x3f0
[ 19.423224] cpuidle_enter+0x2e/0x40
[ 19.423226] cpuidle_idle_call+0x145/0x200
[ 19.423227] do_idle+0x7a/0xe0
[ 19.423228] cpu_startup_entry+0x20/0x30
[ 19.423230] start_secondary+0xe6/0x100
[ 19.423232] secondary_startup_64+0xb6/0xc0
[ 19.423233] handlers:
[ 19.423236] [<00000000750c932b>] i801_isr [i2c_i801]
[ 19.423237] Disabling IRQ #16
我总是可以sudo prime-select intel && sudo systemctl restart gdm3
使用 SSH 来使显示管理器工作,但 NVIDIA 卡就是不工作。
笔记:我不认为这表明 GPU 出现故障。我可以通过添加一些启动参数来让 GPU 工作。例如,我尝试了这些:
quiet splash rcutree.rcu_idle_gp_delay=1 acpi_osi=! acpi_osi='Windows 2009' pci=nomsi
添加参数修复了该问题,但仅限 1 次启动。所以,当我启动笔记本电脑时,一切都运行良好,甚至暂停。 GPU 正在工作,所以我知道它没有故障。而且它在 Windows 中运行良好。当我重新启动笔记本电脑时,它再次卡在黑屏上(是的,我更新了 grub 以使更改永久生效)。
诺姆西禁用 MSI,但无法解决我的问题。GPU 仍然“脱离总线”,但错误消息不同(无法启用 MSI)。
有没有办法可以禁用 PCIe 错误,这样 NVIDIA 驱动程序就不会崩溃?我真的认为它崩溃是因为内核不断向它发送错误消息。任何帮助都将不胜感激。
编辑1:我尝试了 irqpoll 选项,但什么都没解决……奇怪的是,Windows 上一切都运行正常。只是 Ubuntu(如果有必要,我可能会尝试其他发行版)。我无法打开笔记本电脑的外壳,因为这会使维修保修失效。
编辑2:输出lspci -tv
:
-[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
+-02.0 Intel Corporation HD Graphics 620
+-14.0 Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
+-14.2 Intel Corporation Sunrise Point-LP Thermal subsystem
+-17.0 Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode]
+-1c.0-[01]----00.0 NVIDIA Corporation GM108M [GeForce 930MX]
+-1c.4-[02]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
+-1c.5-[03]----00.0 Intel Corporation Wireless 7265
+-1d.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader
+-1f.0 Intel Corporation Sunrise Point-LP LPC Controller
+-1f.2 Intel Corporation Sunrise Point-LP PMC
+-1f.3 Intel Corporation Sunrise Point-LP HD Audio-[0000:00]-+-00.0 Intel Corporation Xeon E3-1200 v6/7th Gen Core Processor Host Bridge/DRAM Registers
+-02.0 Intel Corporation HD Graphics 620
+-14.0 Intel Corporation Sunrise Point-LP USB 3.0 xHCI Controller
+-14.2 Intel Corporation Sunrise Point-LP Thermal subsystem
+-17.0 Intel Corporation Sunrise Point-LP SATA Controller [AHCI mode]
+-1c.0-[01]----00.0 NVIDIA Corporation GM108M [GeForce 930MX]
+-1c.4-[02]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller
+-1c.5-[03]----00.0 Intel Corporation Wireless 7265
+-1d.0-[04]----00.0 Realtek Semiconductor Co., Ltd. RTS522A PCI Express Card Reader
+-1f.0 Intel Corporation Sunrise Point-LP LPC Controller
+-1f.2 Intel Corporation Sunrise Point-LP PMC
+-1f.3 Intel Corporation Sunrise Point-LP HD Audio
\-1f.4 Intel Corporation Sunrise Point-LP SMBus
\-1f.4 Intel Corporation Sunrise Point-LP SMBus
答案1
看起来是硬件问题。我不确定 GPU 出了什么问题,但我认为它与主板的连接方式不正确。我会尝试拆开笔记本电脑,看看出了什么问题。如果我无法修复它,我会再次将笔记本电脑送到服务中心。
我的测试:
- 以前没有发生过
- 现在 Windows 中也开始出现这种情况(我在设备管理器中收到类似错误 46 的信息)
- 这种情况不会在每次启动时发生。有时 GPU 可以工作,但在下次重启、休眠或暂停时停止工作。
- 即使选择了 Intel GPU,我也遇到了随机 PCIe 总线错误(每秒超过 100 条 dmesg 消息)。从内核中删除 GPU(通过将 1 写入即可
/sys/bus/pci/devices/0000:01:00.0/remove
解决此问题,而无需重新启动。
答案2
设备 1c.0 导致了问题...并且 AER(高级错误报告)正在报告此问题...
+-1c.0-[01]----00.0 NVIDIA Corporation GM108M [GeForce 930MX]
虽然像您一样,我怀疑存在硬件问题,但出于测试目的,我们可以尝试这个......
空气质量指数
sudo -H gedit /etc/default/grub
# 编辑此文件
寻找:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash"
更改为:
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer"
保存文件。
sudo update-grub
# 更新 GRUB
reboot
# 重启计算机
否则,您需要将其送回服务中心。