RTX A6000 报告的 PCI 错误

RTX A6000 报告的 PCI 错误

每当机器启动或 GPU 上有负载时,我都会遇到 NVIDIA RTX A6000 的问题。

dmesg 报告AER: buffer overflow in recovery for三个独立的 PCI 地址:

41:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge

AER 也报告称这些问题已得到纠正。但他们也指出,snd_hda_intel 0000:41:00.1该问题也对其产生了影响。

[    5.301395] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301397] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301399] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301401] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301402] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301403] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301405] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301405] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301406] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301407] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301408] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301409] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301410] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301411] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301411] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301413] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301414] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301414] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301416] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301416] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301417] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301418] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301419] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301420] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301421] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301422] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301422] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301424] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301424] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301425] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301426] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301427] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301428] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301429] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301430] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301430] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301432] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301432] snd_hda_intel 0000:41:00.1:    [ 0] RxErr                  (First)
[    5.301433] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301435] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[    5.301435] nvidia 0000:41:00.0:    [ 0] RxErr                  (First)
[    5.301436] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[    5.301437] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[    5.301438] pcieport 0000:40:01.1:    [12] Timeout               
[    5.301439] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[    5.301440] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[    5.301441] pcieport 0000:40:01.1:    [12] Timeout               
[    5.301442] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID

PCI 地址已更正

先前列出的所有 3 个 PCI 地址均存在更正的消息,以下为更正示例:

[   10.419954] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[   10.419957] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[   10.419958] {3}[Hardware Error]: event severity: corrected
[   10.419959] {3}[Hardware Error]:  Error 0, type: corrected
[   10.419960] {3}[Hardware Error]:   section_type: PCIe error
[   10.419960] {3}[Hardware Error]:   port_type: 4, root port
[   10.419961] {3}[Hardware Error]:   version: 0.2
[   10.419961] {3}[Hardware Error]:   command: 0x0407, status: 0x0010
[   10.419962] {3}[Hardware Error]:   device_id: 0000:40:01.1
[   10.419963] {3}[Hardware Error]:   slot: 0
[   10.419964] {3}[Hardware Error]:   secondary_bus: 0x41
[   10.419964] {3}[Hardware Error]:   vendor_id: 0x1022, device_id: 0x1483
[   10.419965] {3}[Hardware Error]:   class_code: 060400
[   10.419966] {3}[Hardware Error]:   bridge: secondary_status: 0x0000, control: 0x0012

测试并尝试解决消息

与供应商合作,我尝试了大量方法来排除该问题。

  • 移除 GPU 即可彻底解决该问题。
  • 更新了机器中两个 NVME/SSD Western Digital SN850X 的固件。
  • 将系统安装到不同的 SSD 型号上,猜测是 WD SN850X 出了问题。
  • PNY 已确认没有适用于 A6000 GPU* 的 BIOS 更新,BIOS 已更新。
  • Windows 运行似乎没有发现任何特定问题。
  • 内核 6.2 已经过测试,确保所有组件均已得到满足。
  • 为防止 PCI 通道上的电源切换导致问题,已在 grub 启动菜单中关闭 ASPM。BIOS 中没有针对 GPU 的 ASPM 控制,只有存储。

一名学生在这台机器上运行了一些计算作业,在使用 GPU 时没有报告任何具体问题。此外,Windows 中的 FurMark 和 Ubuntu 中的 GPUburn 似乎运行正常,这似乎表明问题正在得到纠正。

我仍然渴望更好地了解问题所在,以确保此 AER 消息不会影响机器的未来工作,因为它将用于计算。目前还很难判断这是操作系统的软件问题还是卡的硬件问题。

提前致谢!

答案1

这不是一个解决办法,但我设法将 GPU 移至不同的 PCI 插槽,错误不再出现。GPUburn 测试似乎运行正常。

向主板制造商报告该问题,因为这似乎是一些模糊的 PCI 地址问题(ASUS TeK / Pro WS WRX80E-SAGE SE WIFI)

相关内容