SSD 是否可能发生故障,但没有报告坏扇区?
昨晚我不小心把固态硬盘塞满了,然后开始在启动时遇到持续的错误。我很快释放了 40GB 以上,但仍然遇到问题
- 启动时尝试恢复日志,但冻结(后续重启通常不需要日志恢复)
- 启动后 1-30 分钟内 SSD 开始出现故障
- 终端显示非内置命令的输入/输出错误
- Plasma 崩溃
- 断开网络连接
- ctrl+alt+f2 卡住了(没有机会登录)
我启动了实时磁盘和 SMART 状态以及坏块,但看起来一切都很好
SMART 测试
[liveuser@localhost ~]$ sudo smartctl -H /dev/nvme0n1p3
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.5.5-300.fc24.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
[liveuser@localhost ~]$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 49 C
available_spare : 100%
available_spare_threshold : 10%
percentage_used : 0%
data_units_read : 6,371,635
data_units_written : 5,739,422
host_read_commands : 45,594,657
host_write_commands : 67,766,367
controller_busy_time : 193
power_cycles : 124
power_on_hours : 478
unsafe_shutdowns : 20
media_errors : 0
num_err_log_entries : 1
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1 : 49 C
Temperature Sensor 2 : 58 C
Temperature Sensor 3 : 0 C
Temperature Sensor 4 : 0 C
Temperature Sensor 5 : 0 C
Temperature Sensor 6 : 0 C
Temperature Sensor 7 : 0 C
Temperature Sensor 8 : 0 C
[liveuser@localhost ~]$ sudo nvme smart-log-add /dev/nvme0n1
NVMe Status:INVALID_LOG_PAGE(2109)
[liveuser@localhost nvmetest]$ sudo smartctl -a /dev/nvme0n1p3
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.5.5-300.fc24.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 1TB
Serial Number: S59ANM0R108267E
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization: 897,457,152,000 [897 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Thu Jun 3 14:32:39 2021 EDT
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.80W - - 0 0 0 0 0 0
1 + 6.00W - - 1 1 1 1 0 0
2 + 3.40W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0100W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 62 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 10,303,901 [5.27 TB]
Data Units Written: 5,869,634 [3.00 TB]
Host Read Commands: 53,748,047
Host Write Commands: 75,706,248
Controller Busy Time: 278
Power Cycles: 138
Power On Hours: 508
Unsafe Shutdowns: 28
Media and Data Integrity Errors: 0
Error Information Log Entries: 2
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 62 Celsius
Temperature Sensor 2: 81 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
坏块
[liveuser@localhost ~]$ sudo badblocks -v /dev/nvme0n1p3 > badsectors
Checking blocks 0 to 972042950
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)
[liveuser@localhost ~]$ sudo badblocks -v /dev/nvme0n1p2 > badsectors.swap
Checking blocks 0 to 4194303
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)
[liveuser@localhost ~]$ ls -l
total 32
-rw-rw-r--. 1 liveuser liveuser 0 Jun 1 22:59 badsectors
-rw-rw-r--. 1 liveuser liveuser 0 Jun 1 23:38 badsectors.swap
文件系统检查
[liveuser@localhost e2fsck]$ sudo ./e2fsck -p /dev/nvme0n1p3
/dev/nvme0n1p3: recovering journal
/dev/nvme0n1p3: Clearing orphaned inode 10395651 (uid=1000, gid=1000, mode=040700, size=4096)
/dev/nvme0n1p3: Clearing orphaned inode 10377883 (uid=1000, gid=1000, mode=0100600, size=4194304)
/dev/nvme0n1p3: Clearing orphaned inode 9963223 (uid=1000, gid=1000, mode=0100644, size=45036)
/dev/nvme0n1p3: clean, 781268/60760064 files, 217936225/243010737 blocks
[liveuser@localhost e2fsck]$ sudo ./e2fsck -p /dev/nvme0n1p3
/dev/nvme0n1p3: clean, 781268/60760064 files, 217936225/243010737 blocks
[liveuser@localhost e2fsck]$ sudo ./e2fsck -f /dev/nvme0n1p3
e2fsck 1.46.2 (28-Feb-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p3: 781268/60760064 files (1.0% non-contiguous), 217936225/243010737 blocks
[liveuser@localhost e2fsck]$ sudo ./e2fsck -c /dev/nvme0n1p3
e2fsck 1.46.2 (28-Feb-2021)
Checking for bad blocks (read-only test): 0.00% done, 0:00 elapsed. (0/0/0 errdone
/dev/nvme0n1p3: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 781268/60760064 files (1.0% non-contiguous), 217936225/243010737 blocks
[liveuser@localhost e2fsck]$ sudo ./e2fsck -cvf /dev/nvme0n1p3
e2fsck 1.46.2 (28-Feb-2021)
Checking for bad blocks (read-only test): 0.00% done, 0:00 elapsed. (0/0/0 errdone
/dev/nvme0n1p3: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
781268 inodes used (1.29%, out of 60760064)
6973 non-contiguous files (0.9%)
470 non-contiguous directories (0.1%)
# of inodes with ind/dind/tind blocks: 0/0/0
Extent depth histogram: 747824/2073
217936225 blocks used (89.68%, out of 243010737)
0 bad blocks
21 large files
684923 regular files
58236 directories
0 character device files
0 block device files
1 fifo
1957 links
38095 symbolic links (31358 fast symbolic links)
4 sockets
------------
783216 files
我尝试检查 journalctl,但似乎没有及时报告任何错误或在失败之前将其刷新到磁盘
[liveuser@localhost ~]$ sudo nvme list
Node SN Model Namespace Usage Format FW Rev
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1 S59ANM0R108267E Samsung SSD 970 EVO Plus 1TB 1 976.58 GB / 1.00 TB 512 B + 0 B 2B2QEXM7
[liveuser@localhost ~]$ sudo fstrim --verbose nvme
nvme: 95.5 GiB (102578487296 bytes) trimmed
答案1
运行一周后没有出现问题,我相信我的问题与此处报告的 APST 错误有关:https://bugzilla.kernel.org/show_bug.cgi?id=195039
我已经使用下面的内核启动参数禁用了 APST,现在没有任何问题(但显然需要找到比完全禁用更好的解决方案)
nvme_core.default_ps_max_latency_us=0
我通过启动实时磁盘并安装驱动器、进入驱动器并运行 IO 压力测试(我使用了 Phoronix)直到驱动器开始出现故障,从而获得了更多信息。然后我能够退出 chroot 并阅读 dmesg,发现大量类似于此的错误
EXT4-fs error (device nvme0n1p3): ext4_find_entry
答案2
SSD 是否可能发生故障,但没有报告坏扇区?
是的。错误报告(如 SMART)中的异常与较高的驱动器故障率相关,但并非所有故障都会先出现错误报告。