Supermicro:新的 NVMe 无法正确检测到

Supermicro:新的 NVMe 无法正确检测到

我们有一台 Supermicro SuperServer 2029U-TN24R4T,目前有 8 个 U.2 NVMe 驱动器(三星 PM1725a 1.6 TB),运行在 CentOS 7 上,内核为 5.0.10-1.el7.elrepo.x86_64。添加新驱动器(PM1725b 1.6 TB)后,它会显示几秒钟/dev(但只有nvme8nvme8n1不像人们所期望的那样),然后“消失”。这可以在机箱的不同 SSD 托盘上重现,甚至可以在我们目前使用的完全相同的驱动器上重现(新驱动器是较新的型号)。添加驱动器会在内核日志中产生以下内容:

Jul 5 21:54:25 nvme02 kernel: pciehp 10002:02:05.0:pcie204: Slot(37): Card present
Jul 5 21:54:27 nvme02 kernel: pcieport 10002:02:05.0: Data Link Layer Link Active not set in 1000 msec
Jul 5 21:54:27 nvme02 kernel: pciehp 10002:02:05.0:pcie204: Failed to check link status
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Card present
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Link Up
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pci 10002:07:00.0: BAR 6: assigned [mem 0xc2400000-0xc240ffff pref]
Jul 5 21:54:31 nvme02 kernel: pci 10002:07:00.0: BAR 0: assigned [mem 0xc2410000-0xc2413fff 64bit]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: PCI bridge to [bus 07]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: bridge window [mem 0xc2400000-0xc24fffff]
Jul 5 21:54:31 nvme02 kernel: nvme nvme8: pci function 10002:07:00.0
Jul 5 21:54:31 nvme02 kernel: nvme 10002:07:00.0: enabling device (0000 -> 0002)
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Attention button pressed
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:00:00.0: can't derive routing for PCI INT A
Jul 5 21:54:31 nvme02 kernel: pciehp 10002:02:08.0:pcie204: Slot(136): Powering off due to button press
Jul 5 21:54:31 nvme02 kernel: nvme 10002:07:00.0: PCI INT A: not connected
Jul 5 21:54:31 nvme02 libvirtd: 2019-07-05 19:54:31.593+0000: 15899: error : virPCIDeviceNew:1774 : internal error: dev->name buffer overflow: 10002:07:00.0
Jul 5 21:54:34 nvme02 ipmievd: Unknown sensor ff
Jul 5 21:54:40 nvme02 kernel: nvme nvme8: failed to mark controller CONNECTING
Jul 5 21:54:40 nvme02 kernel: nvme nvme8: Removing after probe failure status: 0
Jul 5 21:54:44 nvme02 ipmievd: Unknown sensor ff

BIOS 仅落后一个版本,更新日志未提及此问题。IPMI 列出新驱动器时没有任何问题,定位功能也正常工作。我认为重新启动可能会有所帮助,但是磁盘必须是(并且通常是)热插拔的,尽管我们尚未测试它,因为我们没有遇到任何磁盘故障。由于提到的行为,我们不想仅出于测试目的而取出生产磁盘。

任何想法都将不胜感激。

答案1

如果您怀疑硬件存在问题,似乎值得给制造商打电话。

您可以使用更稳定的内核修订版来尝试此操作吗,或者您是否受限于特定的操作系统和内核组合?

答案2

鉴于这种

Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: no space for [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 15: failed to assign [mem size 0x00200000 64bit pref]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: no space for [io size 0x1000]
Jul 5 21:54:31 nvme02 kernel: pcieport 10002:02:08.0: BAR 13: failed to assign [io size 0x1000]

尝试将 pci=realloc 添加到内核命令行。

答案3

尝试将以下启动选项添加到 grub:

pci=realloc,noats pcie_aspm=off pcie_ports=dpc_native nvme_core.default_ps_max_latency_us=0

grup-update做出改变后不要忘记。

相关内容