radeon 错误:GPU 锁定:ring 0 停滞超过 x 毫秒

radeon 错误:GPU 锁定:ring 0 停滞超过 x 毫秒

我新安装了带有 Debian Buster 的机器。 GPU 是 radeon FirePro W2100。使用几个小时后,机器突然死机,显示屏切换为“白噪音”,机器无法使用。

在日志中,我看到很多这样的错误:

kernel: radeon 0000:65:00.0: ring 0 stalled for more than 10240msec
kernel: radeon 0000:65:00.0: GPU lockup (current fence id 0x0000000000039bff last fence id 0x0000000000039c42 on ring 0)
kernel: adeon 0000:65:00.0: failed to get a new IB (-35)
kernel: [drm:ffffffff816219d0] *ERROR* Couldn't update BO_VA (-35)
kernel: radeon 0000:65:00.0: failed to get a new IB (-35)

进而

kernel: radeon 0000:65:00.0: ring 0 stalled for more than 10032msec
kernel: radeon 0000:65:00.0: GPU lockup (current fence id 0x0000000000039bff last fence id 0x0000000000039c42 on ring 0)

这些错误是什么意思,我该如何解决这个问题?

这是硬件还是软件问题?

答案1

我上radeon 0000:04:00.0: ring 0 stalled for more than 10240msec我的[AMD/ATI] RV620 GL [FirePro 2450]当我在下面运行 Opera 网络浏览器时Ubuntu 20.04.5 LTS几分钟。 Firefox 或任何其他程序都没有问题,只有 Opera 没有问题。

[128524.943553] radeon 0000:04:00.0: ring 0 stalled for more than 10240msec
[128524.943565] radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000029caf6 last fence id 0x000000000029cafc on ring 0)
[128524.955392] radeon 0000:04:00.0: Saved 185 dwords of commands on ring 0.
[128524.955409] radeon 0000:04:00.0: GPU softreset: 0x00000009
[128524.955413] radeon 0000:04:00.0:   R_008010_GRBM_STATUS      = 0xA2303030
[128524.955417] radeon 0000:04:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[128524.955420] radeon 0000:04:00.0:   R_000E50_SRBM_STATUS      = 0x200010C0
[128524.955423] radeon 0000:04:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[128524.955426] radeon 0000:04:00.0:   R_008678_CP_STALLED_STAT2 = 0x00008002
[128524.955429] radeon 0000:04:00.0:   R_00867C_CP_BUSY_STAT     = 0x00008086
[128524.955432] radeon 0000:04:00.0:   R_008680_CP_STAT          = 0x80018645
[128524.955435] radeon 0000:04:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[128525.013038] radeon 0000:04:00.0: R_008020_GRBM_SOFT_RESET=0x00007FEF
[128525.013097] radeon 0000:04:00.0: SRBM_SOFT_RESET=0x00000100
[128525.015187] radeon 0000:04:00.0:   R_008010_GRBM_STATUS      = 0xA0003030
[128525.015191] radeon 0000:04:00.0:   R_008014_GRBM_STATUS2     = 0x00000003
[128525.015195] radeon 0000:04:00.0:   R_000E50_SRBM_STATUS      = 0x200080C0
[128525.015198] radeon 0000:04:00.0:   R_008674_CP_STALLED_STAT1 = 0x00000000
[128525.015201] radeon 0000:04:00.0:   R_008678_CP_STALLED_STAT2 = 0x00000000
[128525.015204] radeon 0000:04:00.0:   R_00867C_CP_BUSY_STAT     = 0x00000000
[128525.015207] radeon 0000:04:00.0:   R_008680_CP_STAT          = 0x80100000
[128525.015210] radeon 0000:04:00.0:   R_00D034_DMA_STATUS_REG   = 0x44C83D57
[128525.015220] radeon 0000:04:00.0: GPU reset succeeded, trying to resume
[128525.031584] [drm] PCIE gen 2 link speeds already enabled
[128525.034184] [drm] PCIE GART of 512M enabled (table at 0x0000000000142000).
[128525.034222] radeon 0000:04:00.0: WB enabled
[128525.034224] radeon 0000:04:00.0: fence driver on ring 0 use gpu addr 0x0000000010000c00
[128525.034579] radeon 0000:04:00.0: fence driver on ring 5 use gpu addr 0x00000000000521d0
[128525.034797] debugfs: File 'radeon_ring_gfx' in directory '0' already present!
[128525.066237] [drm] ring test on 0 succeeded in 1 usecs
[128525.066242] debugfs: File 'radeon_ring_uvd' in directory '0' already present!
[128525.240884] [drm] ring test on 5 succeeded in 1 usecs
[128525.240893] [drm] UVD initialized successfully.
[128535.695467] radeon 0000:04:00.0: ring 0 stalled for more than 10456msec
[128535.695479] radeon 0000:04:00.0: GPU lockup (current fence id 0x000000000029caf8 last fence id 0x000000000029cafc on ring 0)
[128535.697433] [drm:r600_ib_test [radeon]] *ERROR* radeon: fence wait failed (-35).
[128535.697551] [drm:radeon_ib_ring_tests [radeon]] *ERROR* radeon: failed testing IB on GFX ring (-35).

答案2

这实际上可能是硬件故障。当AMD ATI Radeon HD 8670我在带有内核的 arch linux 上用 GPU玩游戏时,我在我的 PC 上得到了这个6.3.1-zen1-1-zen。这是 HP Zendesk 仅供参考。我尝试将内核降至最后一个 LTS 及其之前的 LTS (5.10 iirc),但在玩游戏几分钟后仍然崩溃。

我碰巧有一台戴尔家庭服务器,运行相同的操作系统和内核(arch with zen),并且有一个AMD ATI Radeon HD 8570GPU。本质上它是同一张卡,但板载 DDR5 iirc 稍少一些。

好吧,我更换了显卡(现在 HP mb 中是 8570,戴尔中是 8670),并且在玩游戏时 8570 没有遇到任何问题。

所以...在所有相同的硬件/软件/固件/驱动程序的情况下,8570 可以工作,而 8670 则不能。我所做的就是换卡;无需重新安装驱动程序或任何东西。我还应该注意到,游戏用过的在 8670 上工作得很好,所以我认为它有一天就完蛋了。

所以我知道硬件故障很少见,但如果这不是一个,我不知道是什么。很抱歉可能会带来坏消息。对我来说,我不使用家庭服务器玩游戏,所以进行此切换对我来说很好。

这是我在 HP 上崩溃的 8760 的 dmesg 日志之一:

...
[32776.529276] radeon 0000:0b:00.0: ring 0 stalled for more than 28224msec
[32776.529282] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086ba on ring 0)
[32776.673264] radeon 0000:0b:00.0: ring 3 stalled for more than 28228msec
[32776.673268] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038154 on ring 3)
[32777.033251] radeon 0000:0b:00.0: ring 0 stalled for more than 28728msec
[32777.033259] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bb on ring 0)
[32777.177236] radeon 0000:0b:00.0: ring 3 stalled for more than 28732msec
[32777.177240] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038156 on ring 3)
[32777.537217] radeon 0000:0b:00.0: ring 0 stalled for more than 29232msec
[32777.537221] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bc on ring 0)
[32777.681206] radeon 0000:0b:00.0: ring 3 stalled for more than 29236msec
[32777.681209] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x0000000000038159 on ring 3)
[32778.041191] radeon 0000:0b:00.0: ring 0 stalled for more than 29736msec
[32778.041194] radeon 0000:0b:00.0: GPU lockup (current fence id 0x0000000000108667 last fence id 0x00000000001086bd on ring 0)
[32778.185183] radeon 0000:0b:00.0: ring 3 stalled for more than 29740msec
[32778.185186] radeon 0000:0b:00.0: GPU lockup (current fence id 0x00000000000380db last fence id 0x000000000003815a on ring 3)
[32779.776047] BUG: unable to handle page fault for address: ffffbdd0c13e9ffc
[32779.776052] #PF: supervisor read access in kernel mode
[32779.776054] #PF: error_code(0x0000) - not-present page
[32779.776055] PGD 100000067 P4D 100000067 PUD 0 
[32779.776058] Oops: 0000 [#1] PREEMPT SMP NOPTI
[32779.776061] CPU: 8 PID: 157222 Comm: openmw Tainted: G S                 6.1.12-zen1-1-zen #1 f86a89fe584efe7bcf920c69db3728bed4671799
[32779.776064] Hardware name: HP HP EliteDesk 705 G5 SFF/8618, BIOS R09 Ver. 02.02.02 11/15/2019
[32779.776065] RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
[32779.776196] Code: 49 c1 e6 02 4c 89 f7 e8 9c cc ab f5 49 89 45 00 48 89 c2 48 85 c0 74 5f 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
[32779.776197] RSP: 0018:ffffbdcccfc5bbd8 EFLAGS: 00010246
[32779.776199] RAX: 0000000000000000 RBX: ffff9460e434d620 RCX: ffffbdccc13ea000
[32779.776201] RDX: ffff9465dbd00000 RSI: ffffbdd0c13e9ffc RDI: 00000000000392d7
[32779.776202] RBP: ffff9460e434d600 R08: 00000000000392d0 R09: 0000000000000006
[32779.776203] R10: fffff6a4d96f4000 R11: 000000000000577f R12: 000000000003dd71
[32779.776204] R13: ffffbdcccfc5bc50 R14: 00000000000f75c4 R15: 00000000ffffffff
[32779.776205] FS:  00007fbd98eb96c0(0000) GS:ffff94677ec00000(0000) knlGS:0000000000000000
[32779.776207] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32779.776208] CR2: ffffbdd0c13e9ffc CR3: 0000000490706000 CR4: 0000000000350ee0
[32779.776210] Call Trace:
[32779.776212]  <TASK>
[32779.776213]  radeon_gpu_reset+0xf7/0x2f0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776243]  radeon_gem_wait_idle_ioctl+0xb8/0x100 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776273]  ? radeon_gem_busy_ioctl+0xb0/0xb0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776302]  drm_ioctl_kernel+0xcd/0x170
[32779.776306]  drm_ioctl+0x1eb/0x450
[32779.776308]  ? radeon_gem_busy_ioctl+0xb0/0xb0 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776337]  radeon_drm_ioctl+0x4d/0x80 [radeon de372908aa1ea62ea129bf192d817412c67e128b]
[32779.776364]  __x64_sys_ioctl+0x94/0xd0
[32779.776369]  do_syscall_64+0x5f/0x90
[32779.776373]  ? do_syscall_64+0x6b/0x90
[32779.776375]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776378]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776380]  ? do_syscall_64+0x6b/0x90
[32779.776382]  ? syscall_exit_to_user_mode+0x2c/0x1d0
[32779.776384]  ? do_syscall_64+0x6b/0x90
[32779.776385]  ? do_syscall_64+0x6b/0x90
[32779.776387]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[32779.776390] RIP: 0033:0x7fbdb591553f
[32779.776418] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00 00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d 00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[32779.776420] RSP: 002b:00007fbd98eb80f0 EFLAGS: 00200246 ORIG_RAX: 0000000000000010
[32779.776422] RAX: ffffffffffffffda RBX: 00007fbd7d74eb80 RCX: 00007fbdb591553f
[32779.776423] RDX: 00007fbd98eb8190 RSI: 0000000040086464 RDI: 0000000000000010
[32779.776425] RBP: 00007fbd98eb8190 R08: 0000000000000000 R09: ffffffffffffffff
[32779.776426] R10: 0000000000000000 R11: 0000000000200246 R12: 0000000040086464
[32779.776427] R13: 0000000000000010 R14: 000055d27885abd0 R15: 000055d278a375d8
[32779.776429]  </TASK>
[32779.776430] Modules linked in: rfcomm xt_nat veth nft_chain_nat xt_MASQUERADE nf_nat nf_conntrack_netlink br_netfilter bridge stp llc rpcsec_gss_krb5 rpcrdma rdma_cm iw_cm nfsv4 ib_cm dns_resolver ib_core nfs fscache wireguard netfs curve25519_x86_64 libchacha20poly1305 chacha_x86_64 poly1305_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel overlay cmac algif_hash algif_skcipher af_alg bnep isofs cdrom amdgpu gpu_sched drm_buddy squashfs vfat fat iwlmvm mac80211 snd_hda_codec_conexant snd_hda_codec_generic libarc4 ledtrig_audio snd_hda_codec_hdmi intel_rapl_msr radeon snd_hda_intel intel_rapl_common btusb edac_mce_amd btrtl snd_intel_dspcfg btbcm snd_intel_sdw_acpi drm_ttm_helper kvm_amd snd_hda_codec btintel iwlwifi snd_hda_core hp_wmi btmtk ttm snd_hwdep sparse_keymap kvm platform_profile wmi_bmof sp5100_tco bluetooth snd_pcm irqbypass r8169 ucsi_acpi drm_display_helper video cfg80211 psmouse rapl typec_ucsi pcspkr snd_timer realtek k10temp i2c_piix4 ecdh_generic cec
[32779.776479]  ipmi_devintf typec snd mdio_devres soundcore ipmi_msghandler ip6t_REJECT rfkill libphy roles nf_reject_ipv6 joydev wmi mousedev gpio_amdpt xt_hl gpio_generic acpi_cpufreq ip6_tables ip6t_rt mac_hid ipt_REJECT nf_reject_ipv4 xt_LOG nf_log_syslog xt_multiport nft_limit xt_limit xt_addrtype xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables libcrc32c nfnetlink nfsd auth_rpcgss nfs_acl lockd grace sg crypto_user sunrpc loop fuse ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 dm_crypt cbc encrypted_keys trusted asn1_encoder tee usbhid uas usb_storage dm_mod crct10dif_pclmul crc32_pclmul crc32c_intel serio_raw polyval_clmulni atkbd polyval_generic gf128mul libps2 ghash_clmulni_intel vivaldi_fmap sha512_ssse3 nvme aesni_intel crypto_simd nvme_core ccp cryptd xhci_pci i8042 xhci_pci_renesas nvme_common serio
[32779.776522] CR2: ffffbdd0c13e9ffc
[32779.776523] ---[ end trace 0000000000000000 ]---
[32779.776524] RIP: 0010:radeon_ring_backup+0xc2/0x160 [radeon]
[32779.776554] Code: 49 c1 e6 02 4c 89 f7 e8 9c cc ab f5 49 89 45 00 48 89 c2 48 85 c0 74 5f 48 8b 4b 10 41 8d 47 01 45 89 ff 23 43 5c 4a 8d 34 b9 <8b> 36 89 32 41 83 fc 01 74 29 ba 04 00 00 00 eb 04 48 8b 4b 10 8d
[32779.776555] RSP: 0018:ffffbdcccfc5bbd8 EFLAGS: 00010246
[32779.776557] RAX: 0000000000000000 RBX: ffff9460e434d620 RCX: ffffbdccc13ea000
[32779.776558] RDX: ffff9465dbd00000 RSI: ffffbdd0c13e9ffc RDI: 00000000000392d7
[32779.776559] RBP: ffff9460e434d600 R08: 00000000000392d0 R09: 0000000000000006
[32779.776560] R10: fffff6a4d96f4000 R11: 000000000000577f R12: 000000000003dd71
[32779.776561] R13: ffffbdcccfc5bc50 R14: 00000000000f75c4 R15: 00000000ffffffff
[32779.776562] FS:  00007fbd98eb96c0(0000) GS:ffff94677ec00000(0000) knlGS:0000000000000000
[32779.776563] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[32779.776565] CR2: ffffbdd0c13e9ffc CR3: 0000000490706000 CR4: 0000000000350ee0

相关内容