我最近组装了一套全 AMD 系统,配备了 Ryzen 7 3700X CPU 和 RX 5500 XT Phantom D Gaming GPU。我有一块 Aorus Pro Wifi 主板和 32GB Trident Z Neo RAM,并启用了 XMP。
我正在运行 Ubuntu 20.10,带有 5.6.13-050613-generic 内核。
我一直遇到 amdgpu 驱动程序冻结 GNOME 和屏幕上的所有窗口(但不冻结鼠标)的问题。需要关闭电源才能修复此问题,尽管 SSH 进入机器可以正常工作(因此内核不会挂起)。
以下是此次崩溃的内核日志的摘录:
635:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:0 pasid:0, for process pid 0 thread pid 0)
636:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: in page starting at address 0x0000000000888000 from client 27
637:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041C50
638:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: MORE_FAULTS: 0x0
639:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: WALKER_ERROR: 0x0
640:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: PERMISSION_FAULTS: 0x5
641:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: MAPPING_ERROR: 0x0
642:May 17 16:29:09 arctic kernel: amdgpu 0000:0b:00.0: RW: 0x1
645:May 17 16:29:19 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
646:May 17 16:29:19 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=10870, emitted seq=10872
647:May 17 16:29:19 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
648:May 17 16:29:19 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
649:May 17 16:29:21 arctic kernel: amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
654:May 17 16:29:21 arctic kernel: amdgpu: [powerplay] SMU is resuming...
655:May 17 16:29:21 arctic kernel: amdgpu: [powerplay] SMU is resumed successfully!
659:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
660:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
661:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
662:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
663:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
664:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
665:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
666:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
667:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
668:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
669:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 12 on hub 0
670:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 13 on hub 0
671:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
672:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
673:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
674:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
680:May 17 16:29:22 arctic kernel: amdgpu 0000:0b:00.0: GPU reset(1) succeeded!
688:May 17 16:29:22 arctic /usr/lib/gdm3/gdm-x-session[2329]: amdgpu: amdgpu_cs_query_fence_status failed.
689:May 17 16:29:22 arctic gnome-shell[2678]: amdgpu: amdgpu_cs_query_fence_status failed.
709:May 17 16:33:23 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
728:May 17 16:39:00 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
852:May 17 16:49:44 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
917:May 17 20:12:32 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
这是 5.6.13 上类似的崩溃:
May 18 03:41:05 arctic kernel: [drm:mod_hdcp_add_display_topology [amdgpu]] *ERROR* Failed to add display topology, DTM TA is not initialized.
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: [gfxhub] page fault (src_id:0 ring:40 vmid:0 pasid:0, for process pid 0 thread pid 0)
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: in page starting at address 0x00000000008fc000 from client 27
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: GCVM_L2_PROTECTION_FAULT_STATUS:0x00041A50
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: MORE_FAULTS: 0x0
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: WALKER_ERROR: 0x0
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: PERMISSION_FAULTS: 0x5
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: MAPPING_ERROR: 0x0
May 18 03:41:05 arctic kernel: amdgpu 0000:0b:00.0: RW: 0x1
May 18 03:41:16 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
May 18 03:41:16 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma0 timeout, signaled seq=6205, emitted seq=6208
May 18 03:41:16 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process pid 0 thread pid 0
May 18 03:41:16 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: GPU reset succeeded, trying to resume
May 18 03:41:18 arctic kernel: [drm] PCIE GART of 512M enabled (table at 0x0000008000E10000).
May 18 03:41:18 arctic kernel: [drm] VRAM is lost due to GPU reset!
May 18 03:41:18 arctic kernel: [drm] PSP is resuming...
May 18 03:41:18 arctic kernel: [drm] reserve 0xa00000 from 0x81fe400000 for PSP TMR
May 18 03:41:18 arctic kernel: amdgpu: [powerplay] SMU is resuming...
May 18 03:41:18 arctic kernel: amdgpu: [powerplay] SMU is resumed successfully!
May 18 03:41:18 arctic kernel: [drm] kiq ring mec 2 pipe 1 q 0
May 18 03:41:18 arctic kernel: [drm] VCN decode and encode initialized successfully(under DPG Mode).
May 18 03:41:18 arctic kernel: [drm] JPEG decode initialized successfully.
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.0 uses VM inv eng 1 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.0 uses VM inv eng 4 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.0 uses VM inv eng 5 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.0 uses VM inv eng 6 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.0.1 uses VM inv eng 7 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.1.1 uses VM inv eng 8 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.2.1 uses VM inv eng 9 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring comp_1.3.1 uses VM inv eng 10 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring sdma0 uses VM inv eng 12 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring sdma1 uses VM inv eng 13 on hub 0
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_dec uses VM inv eng 0 on hub 1
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc0 uses VM inv eng 1 on hub 1
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring vcn_enc1 uses VM inv eng 4 on hub 1
May 18 03:41:18 arctic kernel: amdgpu 0000:0b:00.0: ring jpeg_dec uses VM inv eng 5 on hub 1
May 18 03:41:18 arctic kernel: [drm] recover vram bo from shadow start
May 18 03:41:18 arctic kernel: [drm] recover vram bo from shadow done
May 18 03:41:18 arctic kernel: [drm] Skip scheduling IBs!
这里有一些日志(来自不同内核版本,抱歉,我不确定哪些日志来自哪个内核:
我已经从内核 5.4 升级到 5.5.19 再到 5.6.13,但问题仍然存在。
以下是显示器随机断开时的崩溃日志(内核 5.6.13):
May 18 02:30:57 arctic kernel: [drm:amdgpu_dm_commit_planes.constprop.0 [amdgpu]] *ERROR* Waiting for fences timed out!
May 18 02:30:57 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=167698, emitted seq=167700
May 18 02:30:57 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2090 thread Xorg:cs0 pid 2091
May 18 02:30:57 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
May 18 02:30:59 arctic kernel: amdgpu: [powerplay] failed send message: DisallowGfxOff (42) param: 0x00000000 response 0xffffffc2
May 18 02:31:02 arctic /usr/lib/gdm3/gdm-x-session[2090]: (II) event12 - Logitech MX Master 3000: SYN_DROPPED event - some input events have been lost.
May 18 02:31:02 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
May 18 02:31:02 arctic /usr/lib/gdm3/gdm-x-session[2090]: (EE) client bug: timer event12 debounce: scheduled expiry is in the past (-194ms), your system is too slow
May 18 02:31:02 arctic kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
May 18 02:31:02 arctic kernel: [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
May 18 02:31:04 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state!
May 18 02:31:04 arctic kernel: [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
May 18 02:31:07 arctic kernel: amdgpu: [powerplay] Msg issuing pre-check failed and SMU may be not in the right state! May 18 02:31:07 arctic kernel: [drm:amdgpu_device_gpu_recover.cold [amdgpu]] *ERROR* ASIC reset failed with error, -62 for drm dev, 0000:0b:00.0
May 18 02:31:07 arctic kernel: amdgpu 0000:0b:00.0: GPU reset(1) failed
May 18 02:31:07 arctic kernel: amdgpu 0000:0b:00.0: GPU reset end with ret = -62
May 18 02:31:12 arctic /usr/lib/gdm3/gdm-x-session[2090]: (EE) client bug: timer event12 debounce short: scheduled expiry is in the past (-5ms), your system is too slow May
18 02:31:17 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=167700, emitted seq=167700
May 18 02:31:17 arctic kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 2090 thread Xorg:cs0 pid 2091
May 18 02:31:17 arctic kernel: amdgpu 0000:0b:00.0: GPU reset begin!
我已经设置了 AMD_DEBUG=nodma,nongg,但这没有帮助。我可以更新主板上的 BIOS,尽管我只比最新版本少了一个版本,而且它只提供“内存增强”。我可以尝试专有驱动程序amdgpu-pro
而不是开源amdgpu
驱动程序。但我想不出其他办法了。我已经尝试了 3 个单独的内核……有人有什么想法吗?
$ glxinfo | grep "OpenGL Version"
OpenGL version string: 4.6 (Compatibility Profile) Mesa 20.0.6
答案1
我花了大半天时间试图解决这个问题。我甚至不会提及我尝试过的所有方法。
我可以运行两个显示器,但是当我插入第三个显示器时,系统就会冻结(崩溃、图形内存损坏、半崩溃、内核崩溃 - 你能想到的都有)。
关键日志行/var/log/syslog
如下:
amdgpu: failed to write reg 28b4 wait reg 28c6
amdgpu: failed to write reg 1a6f4 wait reg 1a706
amdgpu: failed send message: NumOfDisplays (64) param: 0x00000003 response 0xffffffc2
amdgpu: Msg issuing pre-check failed and SMU may be not in the right state! [drm:amdgpu_job_timedout [amdgpu]]
*ERROR* ring sdma0 timeout, signaled seq=3474, emitted seq=3476 [drm:amdgpu_job_timedout [amdgpu]]
*ERROR* Process information: process pid 0 thread pid 0 amdgpu 0000:0a:00.0: amdgpu: GPU reset begin! ...
我最终决定升级到我能找到的最新 5.7 内核,因为有些人似乎对更高版本的内核更满意。
就我而言,我下载了.deb
软件包并安装了它们:
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.19/amd64/linux-headers-5.7.19-050719-generic_5.7.19-050719.202008270830_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.19/amd64/linux-image-unsigned-5.7.19-050719-generic_5.7.19-050719.202008270830_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.19/amd64/linux-modules-5.7.19-050719-generic_5.7.19-050719.202008270830_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.7.19/amd64/linux-headers-5.7.19-050719_5.7.19-050719.202008270830_all.deb
dpkg -i linux-*.deb
立即开始工作。所有四台显示器都在工作。
希望我发布这篇文章后,一切不会就此消失!