AMD Radeon Instinct MI25 初始化失败:[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* IP 块的 hw_init失败 -22

AMD Radeon Instinct MI25 初始化失败:[drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* IP 块的 hw_init失败 -22

我买了一张便宜的二手 MI25 卡,把它放进运行 Debian 的电脑里,希望能做一些 ROCm 加速的数字运算(比如机器学习)。内核无法初始化卡,导致 udev 崩溃,启动时间很长。屏幕上可以看到以下内容:

[    6.464978] amdgpu 0000:10:00.0: amdgpu: MEM ECC is active.
[    6.465759] amdgpu 0000:10:00.0: amdgpu: SRAM ECC is not presented.
[    6.466466] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[    6.467209] amdgpu 0000:10:00.0: amdgpu: VRAM: 16368M 0x000000F400000000 - 0x000000F7FEFFFFFF (16368M used)
[    6.468262] amdgpu 0000:10:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    6.469079] amdgpu 0000:10:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[    6.469908] [drm] Detected VRAM RAM=16368M, BAR=16384M
[    6.470728] [drm] RAM width 2048bits HBM
[    6.471573] [drm] amdgpu: 16368M of VRAM memory ready
[    6.472547] [drm] amdgpu: 64236M of GTT memory ready.
[    6.473345] [drm] GART: num cpu pages 131072, num gpu pages 131072
[    6.474611] [drm] PCIE GART of 512M enabled.
[    6.475489] [drm] PTB located at 0x000000F400900000
[    6.476485] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_sos.bin
[    6.477645] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_asd.bin
[    6.478433] amdgpu 0000:10:00.0: amdgpu: PSP runtime database doesn't exist
[    6.479257] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega10_smu
[    6.480187] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_smc.bin
[    6.480973] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_pfp.bin
[    6.481765] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_me.bin
[    6.482596] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_ce.bin
[    6.483340] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_rlc.bin
[    6.484159] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_mec.bin
[    6.485022] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_mec2.bin
[    6.486890] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_uvd.bin
[    6.487735] [drm] Found UVD firmware Version: 66.43 Family ID: 17
[    6.488446] [drm] PSP loading UVD firmware
[    6.490195] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_vce.bin
[    6.491051] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[    6.491994] [drm] PSP loading VCE firmware
[    6.770117] [drm:psp_hw_start [amdgpu]] *ERROR* PSP load sos failed!
[    6.770973] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[    6.771657] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[    6.772277] amdgpu 0000:10:00.0: amdgpu: amdgpu_device_ip_init failed
[    6.772890] amdgpu 0000:10:00.0: amdgpu: Fatal error during GPU init
[    6.773568] amdgpu 0000:10:00.0: amdgpu: amdgpu: finishing device.
[    6.789336] BUG: kernel NULL pointer dereference, address: 0000000000000090

我该如何让这张卡工作?我尝试了很多方法,包括对所有似乎有点相关的 BIOS 设置进行各种排列(高于 4G 解码、CSM/UEFI 等)。我还尝试了较新和较旧的内核,以及使用 AMD 的自定义内核树,但都无济于事。

答案1

(我找到了自己的解决方案,如果这是不礼貌的,我很抱歉。我只是不想让其他人遭受这种痛苦,而且我讨厌人们扔掉完好的硅片的想法。)

首先,在做任何事情之前,请注意该卡是被动冷却的。这意味着你必须安装您自己的风扇和“罩子”(风漏斗)。在没有足够冷却的情况下,请勿打开卡!魔法烟雾可能会出来!

解决方案

经过大量的调试和绝望的尝试,我发现崩溃的原因是固件在卡上初始化的时间太长,无论出于什么原因,因此驱动程序超时并崩溃。我的解决方案只是修补 Linux 内核,将超时从 20 毫秒增加到 500 毫秒:

--- drivers/gpu/drm/amd/amdgpu/psp_v3_1.c       2022-10-01 00:00:00.000000000 -0100
+++ drivers/gpu/drm/amd/amdgpu/psp_v3_1.c       2022-10-01 00:00:00.000000000 -0100
@@ -107,21 +107,21 @@
        psp_copy_fw(psp, psp->sys.start_addr, psp->sys.size_bytes);
 
        /* Provide the sys driver to bootloader */
        WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
               (uint32_t)(psp->fw_pri_mc_addr >> 20));
        psp_gfxdrv_command_reg = PSP_BL__LOAD_SYSDRV;
        WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_35,
               psp_gfxdrv_command_reg);
 
        /* there might be handshake issue with hardware which needs delay */
-       mdelay(20);
+       mdelay(500);
 
        ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_35),
                           0x80000000, 0x80000000, false);
 
        return ret;
 }
 
 static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
 {
        int ret;
@@ -146,21 +146,21 @@
        psp_copy_fw(psp, psp->sos.start_addr, psp->sos.size_bytes);
 
        /* Provide the PSP secure OS to bootloader */
        WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
               (uint32_t)(psp->fw_pri_mc_addr >> 20));
        psp_gfxdrv_command_reg = PSP_BL__LOAD_SOSDRV;
        WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_35,
               psp_gfxdrv_command_reg);
 
        /* there might be handshake issue with hardware which needs delay */
-       mdelay(20);
+       mdelay(500);
        ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_81),
                           RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81),
                           0, true);
        return ret;
 }
 
 static int psp_v3_1_ring_init(struct psp_context *psp,
                              enum psp_ring_type ring_type)
 {
        int ret = 0;

这样卡就可以成功启动。使用此卡时还需注意以下两个事项:

  1. 不是使用系统提供的 OpenCL 实现。在我的情况下,它会导致卡崩溃并强制重置 GPU。为了安全起见,我mesa.icd通过在 中重命名它来禁用它/etc/OpenCL/vendors/
  2. 不知为何,我的 GPU 的 BIOS 功率限制为 110W,而不是额定的 300W。解决方法,还涉及内核补丁。

相关内容