为什么我必须移除/重新扫描我的 PCIe 设备以便内核分配其 BAR 内存?

为什么我必须移除/重新扫描我的 PCIe 设备以便内核分配其 BAR 内存?

当我冷启动 SuperMicro 服务器时,内核 dmesg 日志显示找到我的 PCIe 卡及其 BAR 所需的 MEM 空间...

[  +0.000036] pci 0000:19:00.0: [10de:1eb1] type 00 class 0x030000
[  +0.000022] pci 0000:19:00.0: reg 0x10: [mem 0xc3000000-0xc3ffffff]
[  +0.000011] pci 0000:19:00.0: reg 0x14: [mem 0x3807e0000000-0x3807efffffff 64bit pref]
[  +0.000010] pci 0000:19:00.0: reg 0x1c: [mem 0x3807f0000000-0x3807f1ffffff 64bit pref]
[  +0.000006] pci 0000:19:00.0: reg 0x24: [io  0x7000-0x707f]
[  +0.000006] pci 0000:19:00.0: reg 0x30: [mem 0xc4000000-0xc407ffff pref]
[  +0.000006] pci 0000:19:00.0: enabling Extended Tags
[  +0.000149] pci 0000:19:00.0: PME# supported from D0 D3hot
[  +0.000075] pci 0000:19:00.1: [10de:10f8] type 00 class 0x040300
[  +0.000018] pci 0000:19:00.1: reg 0x10: [mem 0xc4080000-0xc4083fff]
[  +0.000048] pci 0000:19:00.1: enabling Extended Tags
[  +0.000169] pci 0000:19:00.2: [10de:1ad8] type 00 class 0x0c0330
[  +0.000022] pci 0000:19:00.2: reg 0x10: [mem 0x3807f2000000-0x3807f203ffff 64bit pref]
[  +0.000019] pci 0000:19:00.2: reg 0x1c: [mem 0x3807f2040000-0x3807f204ffff 64bit pref]
[  +0.000020] pci 0000:19:00.2: enabling Extended Tags
[  +0.000134] pci 0000:19:00.2: PME# supported from D0 D3hot
[  +0.000041] pci 0000:19:00.3: [10de:1ad9] type 00 class 0x0c8000
[  +0.000018] pci 0000:19:00.3: reg 0x10: [mem 0xc4084000-0xc4084fff]
[  +0.000047] pci 0000:19:00.3: enabling Extended Tags
[  +0.000134] pci 0000:19:00.3: PME# supported from D0 D3hot

对于第二台设备:

[  +0.000073] pci 0000:1a:00.0: [10ee:d03c] type 00 class 0x120000
[  +0.000020] pci 0000:1a:00.0: reg 0x10: [mem 0xc0000000-0xc1ffffff]
[  +0.000008] pci 0000:1a:00.0: reg 0x14: [mem 0xc2000000-0xc200ffff]
[  +0.000043] pci 0000:1a:00.0: enabling Extended Tags

然后日志说无法为系统中的多个 PCIe 卡分配 MEM 空间:

[  +0.000134] pci 0000:19:00.0: BAR 3: assigned [mem 0x380410000000-0x380411ffffff 64bit pref]
[  +0.000135] pci 0000:19:00.0: BAR 0: no space for [mem size 0x01000000]
[  +0.000093] pci 0000:19:00.0: BAR 0: failed to assign [mem size 0x01000000]
[  +0.000096] pci 0000:19:00.0: BAR 6: no space for [mem size 0x00080000 pref]
[  +0.000095] pci 0000:19:00.0: BAR 6: failed to assign [mem size 0x00080000 pref]
[  +0.000124] pci 0000:19:00.2: BAR 0: assigned [mem 0x380412000000-0x38041203ffff 64bit pref]
[  +0.000135] pci 0000:19:00.2: BAR 3: assigned [mem 0x380412040000-0x38041204ffff 64bit pref]
[  +0.000134] pci 0000:19:00.1: BAR 0: no space for [mem size 0x00004000]
[  +0.000093] pci 0000:19:00.1: BAR 0: failed to assign [mem size 0x00004000]
[  +0.000096] pci 0000:19:00.3: BAR 0: no space for [mem size 0x00001000]
[  +0.000094] pci 0000:19:00.3: BAR 0: failed to assign [mem size 0x00001000]
[  +0.000096] pci 0000:19:00.0: BAR 5: assigned [io  0x4000-0x407f]
[  +0.000108] pci 0000:19:00.0: BAR 0: no space for [mem size 0x01000000]
[  +0.000094] pci 0000:19:00.0: BAR 0: failed to assign [mem size 0x01000000]
[  +0.000096] pci 0000:19:00.0: BAR 6: no space for [mem size 0x00080000 pref]
[  +0.000095] pci 0000:19:00.0: BAR 6: failed to assign [mem size 0x00080000 pref]
[  +0.000124] pci 0000:19:00.1: BAR 0: no space for [mem size 0x00004000]
[  +0.000094] pci 0000:19:00.1: BAR 0: failed to assign [mem size 0x00004000]
[  +0.000095] pci 0000:19:00.3: BAR 0: no space for [mem size 0x00001000]
[  +0.000094] pci 0000:19:00.3: BAR 0: failed to assign [mem size 0x00001000]

第二台设备:

[  +0.000131] pci 0000:1a:00.0: BAR 0: no space for [mem size 0x02000000]
[  +0.000094] pci 0000:1a:00.0: BAR 0: failed to assign [mem size 0x02000000]
[  +0.000095] pci 0000:1a:00.0: BAR 1: no space for [mem size 0x00010000]
[  +0.000094] pci 0000:1a:00.0: BAR 1: failed to assign [mem size 0x00010000]
[  +0.000095] pci 0000:1a:00.0: BAR 0: no space for [mem size 0x02000000]
[  +0.000094] pci 0000:1a:00.0: BAR 0: failed to assign [mem size 0x02000000]
[  +0.000095] pci 0000:1a:00.0: BAR 1: no space for [mem size 0x00010000]
[  +0.000093] pci 0000:1a:00.0: BAR 1: failed to assign [mem size 0x00010000]

查看某个设备缺失的地址

$ sudo lspci -vd 10ee:
1a:00.0 Processing accelerators: Xilinx Corporation Device d03c (rev 02)
Subsystem: Xilinx Corporation Device 000e
Physical Slot: 9
Flags: bus master, fast devsel, latency 0, IRQ 26, NUMA node 0
Memory at <ignored> (32-bit, non-prefetchable)
Memory at <ignored> (32-bit, non-prefetchable)
Capabilities: [40] Power Management version 3
Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [1c0] #19
Kernel driver in use: xclmgmt
Kernel modules: xclmgmt

现在,我可以简单地删除根端口,因为网站上的许多问题/答案都要求我这样做,然后重新扫描

$ echo 1 > /sys/bus/pci/devices/0000:ROOTPORT:0.0/remove 
$ echo 1 > /sys/bus/pci/rescan


$ sudo lspci -vd 10ee:
1a:00.0 Processing accelerators: Xilinx Corporation Device d03c (rev 02)
Subsystem: Xilinx Corporation Device 000e
Physical Slot: 9
Flags: bus master, fast devsel, latency 0, IRQ 33, NUMA node 0
Memory at ac000000 (32-bit, non-prefetchable) [size=32M]
Memory at ab000000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [40] Power Management version 3
Capabilities: [48] MSI: Enable- Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [1c0] #19
Kernel driver in use: xclmgmt
Kernel modules: xclmgmt

这些步骤每次都有效。

我不知道为什么需要这样做...

这两张卡来自两个不同的供应商(NVIDIA、Xilinx),具有不同的驱动程序。Ubuntu 中是否有可以每次强制执行此操作的配置?BIOS 是否未正确传递枚举数据?

相关内容