AMD GPU(RX6600)在 Jammy 22.04.1 上挂起

AMD GPU(RX6600)在 Jammy 22.04.1 上挂起

一周内系统会黑屏好几次。仍可以通过 SSH 连接,但图形会一直显示不出来,直到硬重启。

dmesg.logs 中包含与 amd 相关的内容:

    9.574643] amdgpu 0000:2f:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[    9.589995] amdgpu 0000:2f:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    9.589998] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    9.589999] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    9.589999] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    9.590000] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    9.590001] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    9.590001] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    9.590002] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    9.590002] amdgpu 0000:2f:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    9.590003] amdgpu 0000:2f:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    9.590003] amdgpu 0000:2f:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    9.590004] amdgpu 0000:2f:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[    9.590004] amdgpu 0000:2f:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[    9.590005] amdgpu 0000:2f:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[    9.590006] amdgpu 0000:2f:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[    9.590006] amdgpu 0000:2f:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[    9.591822] [drm] Initialized amdgpu 3.42.0 20150101 for 0000:2f:00.0 on minor 0

[405336.815466] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[405339.384808] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[405342.936641] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: status=3
[405343.207409] [drm:dc_dmub_srv_wait_idle [amdgpu]] *ERROR* Error waiting for DMUB idle: 

405382.220639] amdgpu 0000:2f:00.0: amdgpu: Failed to disable gfxoff!

[405397.798055] amdgpu 0000:2f:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[405397.798141] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KGQ disable failed
[405398.078473] amdgpu 0000:2f:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_2.1.0 test failed (-110)
[405398.078540] [drm:gfx_v10_0_hw_fini [amdgpu]] *ERROR* KCQ disable failed
[405403.035730] amdgpu 0000:2f:00.0: amdgpu: SMU: I'm not done with your previous command!
[405403.035735] amdgpu 0000:2f:00.0: amdgpu: Failed to disable smu features.
[405403.035738] amdgpu 0000:2f:00.0: amdgpu: Fail to disable dpm features!
[405403.035739] [drm:amdgpu_device_ip_suspend_phase2 [amdgpu]] *ERROR* suspend of IP block <smu> failed -62
[405403.048480] [drm] free PSP TMR buffer
[405404.144031] [drm] psp gfx command DESTROY_TMR(0x7) failed and response status is (0x80000306)
[405404.165017] amdgpu 0000:2f:00.0: amdgpu: MODE1 reset
[405404.165020] amdgpu 0000:2f:00.0: amdgpu: GPU mode1 reset
[405404.165099] amdgpu 0000:2f:00.0: amdgpu: GPU smu mode1 reset
[405409.277640] amdgpu 0000:2f:00.0: amdgpu: SMU: I'm not done with your previous command!
[405409.277644] amdgpu 0000:2f:00.0: amdgpu: GPU mode1 reset failed
[405409.277749] amdgpu 0000:2f:00.0: amdgpu: ASIC reset failed with error, -62 for drm dev, 0000:2f:00.0
[405420.341892] amdgpu 0000:2f:00.0: amdgpu: GPU reset succeeded, trying to resume
[405420.342197] [drm] PCIE GART of 512M enabled (table at 0x0000008001FA4000).
[405420.342234] [drm] VRAM is lost due to GPU reset!
[405420.343561] [drm] PSP is resuming...
[405421.458583] [drm] failed to load ucode SMC(0x18)
[405421.458601] [drm] psp gfx command LOAD_IP_FW(0x6) failed and response status is (0x80000306)
[405421.458606] [drm] reserve 0xa00000 from 0x81fe000000 for PSP TMR
[405421.695431] amdgpu 0000:2f:00.0: amdgpu: RAS: optional ras ta ucode is not available
[405421.708606] amdgpu 0000:2f:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[405421.708609] amdgpu 0000:2f:00.0: amdgpu: SMU is resuming...
[405421.708613] amdgpu 0000:2f:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw version = 0x003b2900 (59.41.0)
[405421.708615] amdgpu 0000:2f:00.0: amdgpu: SMU driver if version not matched
[405426.947166] amdgpu 0000:2f:00.0: amdgpu: SMU: I'm not done with your previous command!
[405426.947171] amdgpu 0000:2f:00.0: amdgpu: Failed to SetDriverDramAddr!
[405426.947172] amdgpu 0000:2f:00.0: amdgpu: Failed to setup smc hw!
[405426.947173] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[405426.947276] amdgpu 0000:2f:00.0: amdgpu: GPU reset(2) failed
[405426.965191] snd_hda_intel 0000:2f:00.1: refused to change power state from D3hot to D0
[405427.069838] snd_hda_intel 0000:2f:00.1: CORB reset timeout#2, CORBRP = 65535
[405427.069851] amdgpu 0000:2f:00.0: amdgpu: GPU reset end with ret = -62
[405437.169457] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=244959, emitted seq=244959
[405437.169575] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[405437.169662] amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
[405629.935225] INFO: task VizCompositorTh:5424 blocked for more than 120 seconds.
[405629.935230]       Not tainted 5.15.0-47-generic #51-Ubuntu
[405629.935231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[405629.935232] task:VizCompositorTh state:D stack:    0 pid: 5424 ppid:  5332 flags:0x00004002
[405629.935235] Call Trace:
[405629.935236]  <TASK>
[405629.935238]  __schedule+0x23d/0x590
[405629.935242]  schedule+0x4e/0xc0
[405629.935243]  schedule_timeout+0x103/0x140
[405629.935245]  ? kmem_cache_free+0x26c/0x290
[405629.935247]  dma_fence_default_wait+0x1c8/0x1f0
[405629.935250]  ? dma_fence_free+0x30/0x30
[405629.935251]  dma_fence_wait_timeout+0xbf/0xe0
[405629.935254]  drm_sched_entity_fini+0xd7/0x250 [gpu_sched]
[405629.935257]  drm_sched_entity_destroy+0x20/0x30 [gpu_sched]
[405629.935259]  amdgpu_vm_fini+0x2d6/0x4c0 [amdgpu]
[405629.935352]  ? idr_destroy+0x81/0xd0
[405629.935354]  amdgpu_driver_postclose_kms+0x179/0x240 [amdgpu]
[405629.935421]  ? idr_destroy+0x81/0xd0

[405426.947173] [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <smu> failed -62
[405426.947276] amdgpu 0000:2f:00.0: amdgpu: GPU reset(2) failed
[405426.965191] snd_hda_intel 0000:2f:00.1: refused to change power state from D3hot to D0
[405427.069838] snd_hda_intel 0000:2f:00.1: CORB reset timeout#2, CORBRP = 65535
[405427.069851] amdgpu 0000:2f:00.0: amdgpu: GPU reset end with ret = -62
[405437.169457] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring sdma1 timeout, signaled seq=244959, emitted seq=244959
[405437.169575] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process  pid 0 thread  pid 0
[405437.169662] amdgpu 0000:2f:00.0: amdgpu: GPU reset begin!
[405629.935225] INFO: task VizCompositorTh:5424 blocked for more than 120 seconds.
[405629.935230]       Not tainted 5.15.0-47-generic #51-Ubuntu
[405629.935231] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[405629.935232] task:VizCompositorTh state:D stack:    0 pid: 5424 ppid:  5332 flags:0x00004002
[405629.935235] Call Trace:
[405629.935236]  <TASK>
[405629.935238]  __schedule+0x23d/0x590
[405629.935242]  schedule+0x4e/0xc0
[405629.935243]  schedule_timeout+0x103/0x140
[405629.935245]  ? kmem_cache_free+0x26c/0x290
[405629.935247]  dma_fence_default_wait+0x1c8/0x1f0
[405629.935250]  ? dma_fence_free+0x30/0x30
[405629.935251]  dma_fence_wait_timeout+0xbf/0xe0
[405629.935254]  drm_sched_entity_fini+0xd7/0x250 [gpu_sched]
[405629.935257]  drm_sched_entity_destroy+0x20/0x30 [gpu_sched]
[405629.935259]  amdgpu_vm_fini+0x2d6/0x4c0 [amdgpu]
[405629.935352]  ? idr_destroy+0x81/0xd0
[405629.935354]  amdgpu_driver_postclose_kms+0x179/0x240 [amdgpu]
[405629.935421]  ? idr_destroy+0x81/0xd0
[405629.935424]  drm_file_free.part.0+0x1da/0x230 [drm]
[405629.935435]  drm_close_helper.isra.0+0x65/0x70 [drm]
[405629.935445]  drm_release+0x6a/0x120 [drm]
[405629.935454]  __fput+0x9f/0x260
[405629.935457]  ____fput+0xe/0x20
[405629.935458]  task_work_run+0x6d/0xb0
[405629.935460]  do_exit+0x21b/0x3c0
[405629.935462]  do_group_exit+0x3b/0xb0
[405629.935463]  get_signal+0x150/0x900
[405629.935465]  arch_do_signal_or_restart+0xde/0x100
[405629.935467]  ? schedule+0x4e/0xc0
[405629.935468]  exit_to_user_mode_loop+0xc4/0x160
[405629.935470]  exit_to_user_mode_prepare+0xa0/0xb0
[405629.935472]  irqentry_exit_to_user_mode+0x9/0x20
[405629.935474]  irqentry_exit+0x1d/0x30
[405629.935475]  common_interrupt+0x55/0xa0
[405629.935476]  asm_common_interrupt+0x26/0x40
[405629.935477] RIP: 0033:0x55c8c9266d64
[405629.935479] RSP: 002b:00007f4b1320bca0 EFLAGS: 00000206
[405629.935480] RAX: 0000000000000000 RBX: 0000288c00ad1000 RCX: 00007f4b1320cc28
[405629.935481] RDX: 000000000720c851 RSI: 00007f4b1320cc28 RDI: 0000288c00ad1000
[405629.935482] RBP: 00007f4b1320c4d0 R08: 0000000000000000 R09: 000055c8c91501e0
[405629.935482] R10: 0000000000000000 R11: aaaaaaaaaaaaaaaa R12: 0000288c00ad1000
[405629.935483] R13: 0000000000000000 R14: 0000000000000000 R15: 000000000720c851
[405629.935485]  </TASK>
[405629.935561] INFO: task kworker/14:1:544425 blocked for more than 120 seconds.
[405629.935563]       Not tainted 5.15.0-47-generic #51-Ubuntu
[405629.935563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[405629.935564] task:kworker/14:1    state:D stack:    0 pid:544425 ppid:     2 flags:0x00004000
[405629.935566] Workqueue: events drm_sched_job_timedout [gpu_sched]
[405629.935569] Call Trace:
[405629.935569]  <TASK>
[405629.935570]  __schedule+0x23d/0x590
[405629.935571]  schedule+0x4e/0xc0
[405629.935572]  schedule_timeout+0x103/0x140
[405629.935573]  ? task_rq_lock+0x5f/0x160
[405629.935575]  dma_fence_default_wait+0x1c8/0x1f0
[405629.935576]  ? dma_fence_free+0x30/0x30
[405629.935578]  dma_fence_wait_timeout+0xbf/0xe0
[405629.935579]  drm_sched_stop+0xfc/0x170 [gpu_sched]
[405629.935581]  amdgpu_device_gpu_recover.cold+0x861/0x8ff [amdgpu]
[405629.935698]  amdgpu_job_timedout+0x153/0x180 [amdgpu]
[405629.935794]  drm_sched_job_timedout+0x6f/0x120 [gpu_sched]
[405629.935796]  process_one_work+0x22b/0x3d0
[405629.935798]  worker_thread+0x53/0x420
[405629.935799]  ? process_one_work+0x3d0/0x3d0
[405629.935800]  kthread+0x12a/0x150
[405629.935801]  ? set_kthread_struct+0x50/0x50
[405629.935802]  ret_from_fork+0x22/0x30
[405629.935805]  </TASK>

相关内容