Linux 内核和 nvme 驱动器之间的冲突。启用了错误的省电模式？（“nvme_core.default_ps_max_latency_us=0 pcie_aspm=off”没有帮助）

2024-6-12 • tag-icon

Linux 内核和 nvme 驱动器之间的冲突。启用了错误的省电模式？（“nvme_core.default_ps_max_latency_us=0 pcie_aspm=off”没有帮助）

我有两台几乎完全相同的服务器。两台服务器均具有：

映泰GTA690主板
英特尔CPU
相同数量的 RAM 和磁盘布局
PVE 发行版（基于 Debian 的虚拟机管理程序）。内核 5.19.17-1-pve。

我已经为每台购买了 3 个 Seagate Firecuda 530 (2TB)。两台服务器中的六个驱动器以同样的方式出现故障，我不知道为什么。

在此之前，我尝试升级 NVMe SSD 固件，但当我要下载它时，我发现它与已安装的版本相同（SU6SM003，发布日期 22 年 3 月 1 日）。所以我理所当然地认为它们已更新到最新版本。

此外，驱动器完全未格式化且“空”，当我看到它们出现故障时，我正在对它们进行原始 fio 测试以测试性能。

它们在任何类型的测试模式（读、写、randread 和 ranwrite）下都会失败。其中两个驱动器连接到 gen4 m2 连接器，另一个连接到 gen3。即使 PCIe gen3 没有成为数据传输的瓶颈，第三代驱动器似乎也能承受更长的工作时间。我监测过板凳上的温度，从未见过温度达到 60 ℃。警告温度和临界温度为 90 和 95 ℃，因此即使我使用默认的主板散热器，我认为这也不是温度问题。

这里我附上失败kernel.log的示例nvme0：

Feb 17 10:42:58 pve-02 kernel: [   63.521693] pcieport 0000:00:06.0: AER: Corrected error received: 0000:00:06.0
Feb 17 10:42:58 pve-02 kernel: [   63.521703] pcieport 0000:00:06.0: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
Feb 17 10:42:58 pve-02 kernel: [   63.521704] pcieport 0000:00:06.0:   device [8086:464d] error status/mask=00000001/00002000
Feb 17 10:42:58 pve-02 kernel: [   63.521706] pcieport 0000:00:06.0:    [ 0] RxErr
Feb 17 10:43:29 pve-02 kernel: [   95.188263] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Feb 17 10:43:29 pve-02 kernel: [   95.188269] nvme nvme0: Does your device have a faulty power saving mode enabled?
Feb 17 10:43:29 pve-02 kernel: [   95.188270] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Feb 17 10:43:29 pve-02 kernel: [   95.244775] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Feb 17 10:43:29 pve-02 kernel: [   95.244881] nvme nvme0: Removing after probe failure status: -19
Feb 17 10:43:29 pve-02 kernel: [   95.268964] nvme0n1: detected capacity change from 3907029168 to 0

nvme1对于和来说也是一样的nvme2。而且两台服务器上都是一样的。

我尝试nvme_core.default_ps_max_latency_us=0 pcie_aspm=off按照内核日志的建议进行设置。我正在使用 systemd-boot，因此我将其添加到文件中的选项行中/boot/efi/loader/entries/proxmox-5.19.17-1-pve.conf。重启后nvme get-feature /dev/nvme0 -f 0x0c -H返回：

get-feature:0xc (Autonomous Power State Transition), Current value:00000000
    Autonomous Power State Transition Enable (APSTE): Disabled

在启用之前，我希望现在错误能够得到解决，但事实并非如此。驱动器继续出现故障，但现在日志略有不同：

Feb 20 10:50:38 pve-02 kernel: [ 1117.637355] nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
Feb 20 10:50:38 pve-02 kernel: [ 1117.637377] nvme nvme0: Does your device have a faulty power saving mode enabled?
Feb 20 10:50:38 pve-02 kernel: [ 1117.637384] nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Feb 20 10:50:38 pve-02 kernel: [ 1117.694007] nvme0: Admin Cmd(0x6), I/O Error (sct 0x3 / sc 0x71)
Feb 20 10:50:38 pve-02 kernel: [ 1117.733561] nvme 0000:01:00.0: enabling device (0000 -> 0002)
Feb 20 10:50:38 pve-02 kernel: [ 1117.733671] nvme nvme0: Removing after probe failure status: -19
Feb 20 10:50:38 pve-02 kernel: [ 1117.761606] nvme0n1: detected capacity change from 3907029168 to 0

pcieport 错误已经消失，但现在出现了这个新概念：Admin Cmd(0x6), I/O Error (sct 0x3 / sc 0x71)，天知道这意味着什么。

这里我附上服务器2上nvme0的智能数据：

root@pve-02:~# smartctl -a /dev/nvme0
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.19.17-1-pve] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Seagate FireCuda 530 ZP2000GM30013
Serial Number:                      7VR033KY
Firmware Version:                   SU6SM003
PCI Vendor/Subsystem ID:            0x1bb1
IEEE OUI Identifier:                0x6479a7
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      1
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            6479a7 6d3f00fdd1
Local Time is:                      Mon Feb 20 11:46:09 2023 CET
Firmware Updates (0x18):            4 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005d):     Comp DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x08):         Telmtry_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.80W       -        -    0  0  0  0        0       0
 1 +     2.90W       -        -    1  1  1  1        0       0
 2 +     2.80W       -        -    2  2  2  2        0       0
 3 -   0.0250W       -        -    3  3  3  3     2500    7500
 4 -   0.0050W       -        -    4  4  4  4    10500   65000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         2
 1 -    4096       0         1

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        28 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    0%
Data Units Read:                    9,323,369 [4.77 TB]
Data Units Written:                 2,755,621 [1.41 TB]
Host Read Commands:                 125,896,434
Host Write Commands:                271,550,259
Controller Busy Time:               26
Power Cycles:                       2,348
Power On Hours:                     287
Unsafe Shutdowns:                   2,312
Media and Data Integrity Errors:    0
Error Information Log Entries:      88
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 63 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0         88     0  0x0008  0x4004  0x028            0     0     -
  1         87     0  0x1014  0x4004      -            0     0     -

我不知道下一步该做什么，我感觉自己走进了死胡同，所以任何帮助都会有所帮助。

作为一种附录，我将在这里写下我所做的 fio 测试的更详细的描述，以防万一对某人有用。

首先，我在 4k 随机写入模式下进行测试，尝试测试驱动器的写入最大 IOPS。即使三个磁盘提供 400k IOPS（最初，它们随着时间的推移缓慢下降），第 3 代磁盘也能持续很长时间。我已经多次重复此操作，但总是相同，我不明白为什么如果所有三个磁盘的性能都相同。

然后我将输出设置 iodepth 和 jobs 减少到 1。这样 IOPS 开始为 140k IOPS。在此模式下，第 4 代驱动器也出现故障（在 56 和 57 ℃ 时），但第 3 代驱动器运行了 4 小时，没有出现错误。当然，第 3 代和第 4 代驱动器具有相同的 BW 和 IOPS。

然后我切换到最大 4k IOPS。第 4 代为 1M IOPS，第 3 代为 700k。第四代驱动器在几秒钟内就被消灭了，第三代驱动器最多两分钟就被消灭了。温度从未超过55度。

最后我切换到最大 4M 顺序读取。第 4 代为 1700 IOPS，第 3 代为 850 IOPS。 b.这里首先崩溃了一个 gen4，然后是一个 gen3，最后是另一个 gen4。事实上，最后一次是最不同的，持续 15 到 20 分钟。

然后我放弃了，因为这些对我来说都没有意义，所以欢迎任何帮助。

相关内容