AMD Radeon 显卡锁定:这可能是硬件问题吗?

AMD Radeon 显卡锁定:这可能是硬件问题吗?

我拥有这台基于 AMD 且配备 Radeon Vega 56 显卡的台式机大约 2.5 年了。它自始至终都非常稳定,包括玩游戏,使它像空间加热器一样运行。在过去的一个月里,它崩溃了几次,这不是很好,但我一直很忙,所以我重新启动并继续前进。然而今天,它不断崩溃。崩溃导致日志如下:

Jan 16 17:05:16 [hostname] kernel: rfkill: input handler disabled
Jan 16 17:05:21 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D0 to D3hot (config space inaccessible)
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_dm_atomic_commit_tail [amdgpu]] *ERROR* Waiting for fences timed out!
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=77, emitted seq=79
Jan 16 17:05:28 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process gnome-shell pid 1396 thread gnome-shel:cs0 pid 1453
Jan 16 17:05:28 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:05:28 [hostname] kernel: [drm] REG_WAIT timeout 10us * 3000 tries - dce110_stream_encoder_dp_blank line:955
Jan 16 17:05:48 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DF8C (len 824, WS 0, PS 0) @ 0xE10C
Jan 16 17:05:48 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DE46 (len 326, WS 0, PS 0) @ 0xDF36
Jan 16 17:05:48 [hostname] kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
Jan 16 17:06:08 [hostname] kernel: [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
Jan 16 17:06:08 [hostname] kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing C0B6 (len 62, WS 0, PS 0) @ 0xC0D2
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x4c, input parameter: 0x3, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x42, input parameter: 0x1, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x24, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:08 [hostname] kernel: [drm:dce110_vblank_set [amdgpu]] *ERROR* Failed to get VBLANK!
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x22, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x25, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x30, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x9, input parameter: 0xf4, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xa, input parameter: 0xf1b000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0xe, input parameter: 0x0, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x4000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x8000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x400, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x30f, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x800, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x1000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x2000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x80000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x40, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu: [powerplay] Failed message: 0x5, input parameter: 0x10000000, error code: 0xffffffff
Jan 16 17:06:09 [hostname] kernel: amdgpu 0000:28:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
Jan 16 17:06:10 [hostname] kernel: [drm] Timeout wait for RLC serdes 0,0
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* ring_buffer_start = 0000000034d786ac; ring_buffer_end = 00000000c05dc59d; write_frame = 0000000094e0183d
Jan 16 17:06:10 [hostname] kernel: [drm:psp_ring_cmd_submit [amdgpu]] *ERROR* write_frame is pointing to address out of bounds
Jan 16 17:06:10 [hostname] kernel: [drm:psp_suspend [amdgpu]] *ERROR* Failed to unload asd
Jan 16 17:06:10 [hostname] kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <psp> failed -22
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: MODE1 reset
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset
Jan 16 17:06:10 [hostname] kernel: [drm] psp is not working correctly before mode1 reset!
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU mode1 reset failed
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: ASIC reset failed with error, -22 for drm dev, 0000:28:00.0
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset(2) failed
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: can't change power state from D3cold to D0 (config space inaccessible)
Jan 16 17:06:10 [hostname] kernel: snd_hda_intel 0000:28:00.1: CORB reset timeout#2, CORBRP = 65535
Jan 16 17:06:10 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset end with ret = -22
Jan 16 17:06:20 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered
Jan 16 17:06:30 [hostname] kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, but soft recovered

发生这种情况时,显示器突然说没有信号并变暗。然而,系统实际上并没有关闭:我可以通过 SSH 登录并查看日志、添加和删除软件等。

我知道有一个新的内核(5.10)和更新的台面updates-testing,所以我做的第一件事就是回滚这些,但问题仍然存在。事实上,情况变得更糟:一开始,在几个小时内出现了几次,但当我试图诊断时,有时它甚至不会让我在崩溃之前登录。所以,问题发生在:

  • 核心5.10.7
  • 核心5.9.16

  • 台面-*20.2.6
  • 台面-*20.3.3

我什至使用 Fedora 33 Live 映像启动,虽然我无法 ssh 进行测试,但在不到 5 分钟后显示器就停止运行,我也遇到了同样的崩溃。

突然开始这件事很奇怪。我已经进行了一些基本的网络搜索,但我看到的大部分内容都是旧的,并指出了驱动程序和卡怪癖的各种问题。似乎如果这就是问题所在,那么这种情况就会一直发生。

我也不认为它特别热——我之前在 Wine 下玩过《博德之门 3》(比如,在假期里玩了相当长的时间),而且我没有遇到任何问题,尽管风扇确实在运行并且像空间加热器一样排出热量。今天,我把它关掉半个小时,再次启动几分钟后它仍然冻结。

我尝试sudo cat /sys/kernel/debug/dri/0/amdgpu_gpu_recover在这里建议,但这只是让我

Jan 16 21:41:53 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: GPU reset begin!
Jan 16 21:41:53 [hostname] kernel: amdgpu 0000:28:00.0: amdgpu: Bailing on TDR for s_job:ffffffffffffffff, as another already in progress

在日志中。

有什么见解吗?有什么我应该尝试的吗?

相关内容