我买了一张便宜的二手 MI25 卡,把它放进运行 Debian 的电脑里,希望能做一些 ROCm 加速的数字运算(比如机器学习)。内核无法初始化卡,导致 udev 崩溃,启动时间很长。屏幕上可以看到以下内容:
[ 6.464978] amdgpu 0000:10:00.0: amdgpu: MEM ECC is active.
[ 6.465759] amdgpu 0000:10:00.0: amdgpu: SRAM ECC is not presented.
[ 6.466466] [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
[ 6.467209] amdgpu 0000:10:00.0: amdgpu: VRAM: 16368M 0x000000F400000000 - 0x000000F7FEFFFFFF (16368M used)
[ 6.468262] amdgpu 0000:10:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[ 6.469079] amdgpu 0000:10:00.0: amdgpu: AGP: 267419648M 0x000000F800000000 - 0x0000FFFFFFFFFFFF
[ 6.469908] [drm] Detected VRAM RAM=16368M, BAR=16384M
[ 6.470728] [drm] RAM width 2048bits HBM
[ 6.471573] [drm] amdgpu: 16368M of VRAM memory ready
[ 6.472547] [drm] amdgpu: 64236M of GTT memory ready.
[ 6.473345] [drm] GART: num cpu pages 131072, num gpu pages 131072
[ 6.474611] [drm] PCIE GART of 512M enabled.
[ 6.475489] [drm] PTB located at 0x000000F400900000
[ 6.476485] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_sos.bin
[ 6.477645] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_asd.bin
[ 6.478433] amdgpu 0000:10:00.0: amdgpu: PSP runtime database doesn't exist
[ 6.479257] amdgpu: [powerplay] hwmgr_sw_init smu backed is vega10_smu
[ 6.480187] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_smc.bin
[ 6.480973] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_pfp.bin
[ 6.481765] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_me.bin
[ 6.482596] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_ce.bin
[ 6.483340] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_rlc.bin
[ 6.484159] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_mec.bin
[ 6.485022] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_mec2.bin
[ 6.486890] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_uvd.bin
[ 6.487735] [drm] Found UVD firmware Version: 66.43 Family ID: 17
[ 6.488446] [drm] PSP loading UVD firmware
[ 6.490195] amdgpu 0000:10:00.0: firmware: direct-loading firmware amdgpu/vega10_vce.bin
[ 6.491051] [drm] Found VCE firmware Version: 57.6 Binary ID: 4
[ 6.491994] [drm] PSP loading VCE firmware
[ 6.770117] [drm:psp_hw_start [amdgpu]] *ERROR* PSP load sos failed!
[ 6.770973] [drm:psp_hw_init [amdgpu]] *ERROR* PSP firmware loading failed
[ 6.771657] [drm:amdgpu_device_fw_loading [amdgpu]] *ERROR* hw_init of IP block <psp> failed -22
[ 6.772277] amdgpu 0000:10:00.0: amdgpu: amdgpu_device_ip_init failed
[ 6.772890] amdgpu 0000:10:00.0: amdgpu: Fatal error during GPU init
[ 6.773568] amdgpu 0000:10:00.0: amdgpu: amdgpu: finishing device.
[ 6.789336] BUG: kernel NULL pointer dereference, address: 0000000000000090
我该如何让这张卡工作?我尝试了很多方法,包括对所有似乎有点相关的 BIOS 设置进行各种排列(高于 4G 解码、CSM/UEFI 等)。我还尝试了较新和较旧的内核,以及使用 AMD 的自定义内核树,但都无济于事。
答案1
(我找到了自己的解决方案,如果这是不礼貌的,我很抱歉。我只是不想让其他人遭受这种痛苦,而且我讨厌人们扔掉完好的硅片的想法。)
首先,在做任何事情之前,请注意该卡是被动冷却的。这意味着你必须安装您自己的风扇和“罩子”(风漏斗)。在没有足够冷却的情况下,请勿打开卡!魔法烟雾可能会出来!
解决方案
经过大量的调试和绝望的尝试,我发现崩溃的原因是固件在卡上初始化的时间太长,无论出于什么原因,因此驱动程序超时并崩溃。我的解决方案只是修补 Linux 内核,将超时从 20 毫秒增加到 500 毫秒:
--- drivers/gpu/drm/amd/amdgpu/psp_v3_1.c 2022-10-01 00:00:00.000000000 -0100
+++ drivers/gpu/drm/amd/amdgpu/psp_v3_1.c 2022-10-01 00:00:00.000000000 -0100
@@ -107,21 +107,21 @@
psp_copy_fw(psp, psp->sys.start_addr, psp->sys.size_bytes);
/* Provide the sys driver to bootloader */
WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
(uint32_t)(psp->fw_pri_mc_addr >> 20));
psp_gfxdrv_command_reg = PSP_BL__LOAD_SYSDRV;
WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_35,
psp_gfxdrv_command_reg);
/* there might be handshake issue with hardware which needs delay */
- mdelay(20);
+ mdelay(500);
ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_35),
0x80000000, 0x80000000, false);
return ret;
}
static int psp_v3_1_bootloader_load_sos(struct psp_context *psp)
{
int ret;
@@ -146,21 +146,21 @@
psp_copy_fw(psp, psp->sos.start_addr, psp->sos.size_bytes);
/* Provide the PSP secure OS to bootloader */
WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_36,
(uint32_t)(psp->fw_pri_mc_addr >> 20));
psp_gfxdrv_command_reg = PSP_BL__LOAD_SOSDRV;
WREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_35,
psp_gfxdrv_command_reg);
/* there might be handshake issue with hardware which needs delay */
- mdelay(20);
+ mdelay(500);
ret = psp_wait_for(psp, SOC15_REG_OFFSET(MP0, 0, mmMP0_SMN_C2PMSG_81),
RREG32_SOC15(MP0, 0, mmMP0_SMN_C2PMSG_81),
0, true);
return ret;
}
static int psp_v3_1_ring_init(struct psp_context *psp,
enum psp_ring_type ring_type)
{
int ret = 0;
这样卡就可以成功启动。使用此卡时还需注意以下两个事项:
- 做不是使用系统提供的 OpenCL 实现。在我的情况下,它会导致卡崩溃并强制重置 GPU。为了安全起见,我
mesa.icd
通过在 中重命名它来禁用它/etc/OpenCL/vendors/
。 - 不知为何,我的 GPU 的 BIOS 功率限制为 110W,而不是额定的 300W。解决方法,还涉及内核补丁。