每当机器启动或 GPU 上有负载时,我都会遇到 NVIDIA RTX A6000 的问题。
dmesg 报告AER: buffer overflow in recovery for
三个独立的 PCI 地址:
41:00.1 Audio device: NVIDIA Corporation GA102 High Definition Audio Controller (rev a1)
41:00.0 VGA compatible controller: NVIDIA Corporation GA102GL [RTX A6000] (rev a1)
40:01.1 PCI bridge: Advanced Micro Devices, Inc. [AMD] Starship/Matisse GPP Bridge
AER 也报告称这些问题已得到纠正。但他们也指出,snd_hda_intel 0000:41:00.1
该问题也对其产生了影响。
[ 5.301395] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301397] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301399] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301401] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301402] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301403] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301405] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301405] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301406] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301407] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301408] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301409] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301410] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301411] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301411] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301413] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301414] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301414] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301416] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301416] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301417] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301418] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301419] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301420] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301421] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301422] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301422] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301424] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301424] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301425] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301426] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301427] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301428] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301429] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301430] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301430] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301432] snd_hda_intel 0000:41:00.1: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301432] snd_hda_intel 0000:41:00.1: [ 0] RxErr (First)
[ 5.301433] snd_hda_intel 0000:41:00.1: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301435] nvidia 0000:41:00.0: AER: aer_status: 0x00000001, aer_mask: 0x00000000
[ 5.301435] nvidia 0000:41:00.0: [ 0] RxErr (First)
[ 5.301436] nvidia 0000:41:00.0: AER: aer_layer=Physical Layer, aer_agent=Receiver ID
[ 5.301437] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 5.301438] pcieport 0000:40:01.1: [12] Timeout
[ 5.301439] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
[ 5.301440] pcieport 0000:40:01.1: AER: aer_status: 0x00001000, aer_mask: 0x00000000
[ 5.301441] pcieport 0000:40:01.1: [12] Timeout
[ 5.301442] pcieport 0000:40:01.1: AER: aer_layer=Data Link Layer, aer_agent=Transmitter ID
PCI 地址已更正
先前列出的所有 3 个 PCI 地址均存在更正的消息,以下为更正示例:
[ 10.419954] {3}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 512
[ 10.419957] {3}[Hardware Error]: It has been corrected by h/w and requires no further action
[ 10.419958] {3}[Hardware Error]: event severity: corrected
[ 10.419959] {3}[Hardware Error]: Error 0, type: corrected
[ 10.419960] {3}[Hardware Error]: section_type: PCIe error
[ 10.419960] {3}[Hardware Error]: port_type: 4, root port
[ 10.419961] {3}[Hardware Error]: version: 0.2
[ 10.419961] {3}[Hardware Error]: command: 0x0407, status: 0x0010
[ 10.419962] {3}[Hardware Error]: device_id: 0000:40:01.1
[ 10.419963] {3}[Hardware Error]: slot: 0
[ 10.419964] {3}[Hardware Error]: secondary_bus: 0x41
[ 10.419964] {3}[Hardware Error]: vendor_id: 0x1022, device_id: 0x1483
[ 10.419965] {3}[Hardware Error]: class_code: 060400
[ 10.419966] {3}[Hardware Error]: bridge: secondary_status: 0x0000, control: 0x0012
测试并尝试解决消息
与供应商合作,我尝试了大量方法来排除该问题。
- 移除 GPU 即可彻底解决该问题。
- 更新了机器中两个 NVME/SSD Western Digital SN850X 的固件。
- 将系统安装到不同的 SSD 型号上,猜测是 WD SN850X 出了问题。
- PNY 已确认没有适用于 A6000 GPU* 的 BIOS 更新,BIOS 已更新。
- Windows 运行似乎没有发现任何特定问题。
- 内核 6.2 已经过测试,确保所有组件均已得到满足。
- 为防止 PCI 通道上的电源切换导致问题,已在 grub 启动菜单中关闭 ASPM。BIOS 中没有针对 GPU 的 ASPM 控制,只有存储。
一名学生在这台机器上运行了一些计算作业,在使用 GPU 时没有报告任何具体问题。此外,Windows 中的 FurMark 和 Ubuntu 中的 GPUburn 似乎运行正常,这似乎表明问题正在得到纠正。
我仍然渴望更好地了解问题所在,以确保此 AER 消息不会影响机器的未来工作,因为它将用于计算。目前还很难判断这是操作系统的软件问题还是卡的硬件问题。
提前致谢!
答案1
这不是一个解决办法,但我设法将 GPU 移至不同的 PCI 插槽,错误不再出现。GPUburn 测试似乎运行正常。
向主板制造商报告该问题,因为这似乎是一些模糊的 PCI 地址问题(ASUS TeK / Pro WS WRX80E-SAGE SE WIFI)