使用 NVME 或硬件隔离 I/O 问题?

使用 NVME 或硬件隔离 I/O 问题?

硬件:

  • 三星 980 PRO M.2 NVMe 固态硬盘 (MZ-V8P2T0BW) (2TB)
  • Beelink GTR6,SSD位于NVMe插槽中

自从硬件到达后,我在上面安装了 Ubuntu Server 以及一堆服务(主要是 docker、DB 和 Kafka 等服务)。

经过 2-3 天的正常运行时间(记录几乎是一周,但通常是 2-3 天)后,我通常开始在 nvme 插槽(也是引导驱动器)上收到缓冲区 I/O 错误: 屏幕1

如果我足够快,我仍然可以通过 SSH 登录,但系统会变得越来越不稳定,然后命令开始因 I/O 错误而失败。当我成功登录时,它似乎认为没有连接 NVME SSD: 屏幕2

nvme 插槽上缓冲区 I/O 错误的另一个实例: 屏幕3

因此,并尝试检查我能找到的所有内容,我在启动时运行 FSCK 以查看是否有任何明显的情况 - 这在硬重置后很常见:

# cat  /run/initramfs/fsck.log
Log of fsck -C -f -y -V -t ext4 /dev/mapper/ubuntu--vg-ubuntu--lv
Fri Dec 30 17:26:21 2022

fsck from util-linux 2.37.2
[/usr/sbin/fsck.ext4 (1) -- /dev/mapper/ubuntu--vg-ubuntu--lv] fsck.ext4 -f -y -C0 /dev/mapper/ubuntu--vg-ubuntu--lv
e2fsck 1.46.5 (30-Dec-2021)
/dev/mapper/ubuntu--vg-ubuntu--lv: recovering journal
Clearing orphaned inode 524449 (uid=1000, gid=1000, mode=0100664, size=6216)
Pass 1: Checking inodes, blocks, and sizes
Inode 6947190 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947197 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947204 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947212 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947408 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947414 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947829 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947835 extent tree (at level 1) could be shorter.  Optimize? yes

Inode 6947841 extent tree (at level 1) could be shorter.  Optimize? yes

Pass 1E: Optimizing extent trees
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Free blocks count wrong (401572584, counted=405399533).
Fix? yes

Free inodes count wrong (121360470, counted=121358242).
Fix? yes


/dev/mapper/ubuntu--vg-ubuntu--lv: ***** FILE SYSTEM WAS MODIFIED *****
/dev/mapper/ubuntu--vg-ubuntu--lv: 538718/121896960 files (0.2% non-contiguous), 82178067/487577600 blocks
fsck exited with status code 1

Fri Dec 30 17:26:25 2022
----------------

运行 smart-log 似乎没有显示任何相关信息,除了不安全关闭的次数(到目前为止发生这种情况的次数)...

# nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                        : 0
temperature                                : 32 C (305 Kelvin)
available_spare                                : 100%
available_spare_threshold                : 10%
percentage_used                                : 0%
endurance group critical warning summary: 0
data_units_read                                : 8,544,896
data_units_written                        : 5,175,904
host_read_commands                        : 39,050,379
host_write_commands                        : 191,366,905
controller_busy_time                        : 1,069
power_cycles                                : 21
power_on_hours                                : 142
unsafe_shutdowns                        : 12
media_errors                                : 0
num_err_log_entries                        : 0
Warning Temperature Time                : 0
Critical Composite Temperature Time        : 0
Temperature Sensor 1           : 32 C (305 Kelvin)
Temperature Sensor 2           : 36 C (309 Kelvin)
Thermal Management T1 Trans Count        : 0
Thermal Management T2 Trans Count        : 0
Thermal Management T1 Total Time        : 0
Thermal Management T2 Total Time        : 0

我已经联系了支持人员,他们最初的建议以及一系列问题是我是否尝试过重新安装操作系统。我也尝试过,格式化驱动器并重新安装操作系统(Ubuntu Server 22 LTS)。

此后,这个问题已经 4 天没有发生了,最后才以内核恐慌的形式出现: 在此输入图像描述

我可以采取什么措施来确定问题是出在 SSD 本身还是 SSD 插入的硬件(GTR6)?我必须在 31 日之前退回 SSD,因此希望尽早确定问题最可能的原因...

在看到其他人使用三星 990 Pro 出现严重健康问题的报道后,我更加担心了: https://www.reddit.com/r/hardware/comments/10jkwwh/samsung_990_pro_ssd_with_rapid_health_drops/

编辑:虽然我意识到这些报告的问题是 990 pro,而不是我的 980 pro!

Edit2:超频玩家中的某人好心地建议了 hd Sentinel,它确实显示了健康指标,这似乎还不错:

# ./hdsentinel-019c-x64       
Hard Disk Sentinel for LINUX console 0.19c.9986 (c) 2021 [email protected]
Start with -r [reportfile] to save data to report, -h for help

Examining hard disk configuration ...
HDD Device  0: /dev/nvme0             
HDD Model ID : Samsung SSD 980 PRO 2TB
HDD Serial No: S69ENL0T905031A
HDD Revision : 5B2QGXA7
HDD Size     : 1907729 MB
Interface    : NVMe
Temperature  : 41 °C
Highest Temp.: 41 °C
Health       : 99 %
Performance  : 100 %
Power on time: 21 days, 12 hours
Est. lifetime: more than 1000 days
Total written: 8.30 TB
  The status of the solid state disk is PERFECT. Problematic or weak sectors were not found. 
  The health is determined by SSD specific S.M.A.R.T. attribute(s):  Available Spare (Percent), Percentage Used
    No actions needed.

最后,我尝试过的所有东西(例如智能日志)似乎都没有显示出类似健康指标的东西。我怎样才能在ubuntu中检查这个?

谢谢!

答案1

我遇到了同样的问题,设备消失了......启动后它通常就在那里,但它以某种方式给了内核(或驱动程序)认为它正在消失的理由。

当我在 Windows 中进行完整的块检查时,它仅持续了 14 个多小时,并且坏块数为 0%...我的驱动器才使用了一个月,所以我预计硬件仍然良好,它一定是驱动程序或主板/BIOS 交互问题...

输出示例:

[  646.205010] nvme nvme1: I/O 526 QID 2 timeout, aborting
[  646.205039] nvme nvme1: I/O 213 QID 5 timeout, aborting
[  646.264489] nvme nvme1: Abort status: 0x0
[  646.351285] nvme nvme1: Abort status: 0x0
[  676.924830] nvme nvme1: I/O 526 QID 2 timeout, reset controller
[  697.972569] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[  697.993956] pcieport 10000:e0:1b.0: can't derive routing for PCI INT A
[  697.993965] nvme 10000:e2:00.0: PCI INT A: no GSI
[  709.369577] wlp45s0: AP e0:cc:7a:98:7d:d4 changed bandwidth, new config is 2432.000 MHz, width 2 (2442.000/0 MHz)
[  718.496375] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[  718.496381] nvme nvme1: Removing after probe failure status: -19
[  739.020199] nvme nvme1: Device not ready; aborting reset, CSTS=0x1
[  739.020477] nvme1n1: detected capacity change from 2000409264 to 0

现在我已经尝试过这个:echo 10000:e2:00.0 >/sys/bus/pci/drivers/nvme/bind

当我这样做时,lspci“丢失”设备被正确枚举(10000:e2:00.0非易失性内存控制器:ADATA Technology Co., Ltd. Device 5766(rev 01))

但它没有出现lsblk,我不知道如何从这里继续......

重新绑定驱动器后的 dmesg 输出:

[14893.259570] nvme nvme2: pci function 10000:e2:00.0
[14893.259678] pcieport 10000:e0:1b.0: can't derive routing for PCI INT A
[14893.259685] nvme 10000:e2:00.0: PCI INT A: no GSI
[14913.760764] nvme nvme2: Device not ready; aborting reset, CSTS=0x1
[14913.760771] nvme nvme2: Removing after probe failure status: -19

最终我得到了一个新模块,将其放在同一个插槽中(替换旧模块),并且一切正常。

结论:[很可能]是一根坏的 NVMe 棒。这种情况发生了,我敢打赌你的设置也是一样的。

相关内容