SSD 出现故障,没有坏扇区?

SSD 出现故障,没有坏扇区?

SSD 是否可能发生故障,但没有报告坏扇区?

昨晚我不小心把固态硬盘塞满了,然后开始在启动时遇到持续的错误。我很快释放了 40GB 以上,但仍然遇到问题

  • 启动时尝试恢复日志,但冻结(后续重启通常不需要日志恢复)
  • 启动后 1-30 分钟内 SSD 开始出现故障
  • 终端显示非内置命令的输入/输出错误
  • Plasma 崩溃
  • 断开网络连接
  • ctrl+alt+f2 卡住了(没有机会登录)

我启动了实时磁盘和 SMART 状态以及坏块,但看起来一切都很好

SMART 测试

[liveuser@localhost ~]$ sudo smartctl -H /dev/nvme0n1p3
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.5.5-300.fc24.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

[liveuser@localhost ~]$ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 49 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 0%
data_units_read                     : 6,371,635
data_units_written                  : 5,739,422
host_read_commands                  : 45,594,657
host_write_commands                 : 67,766,367
controller_busy_time                : 193
power_cycles                        : 124
power_on_hours                      : 478
unsafe_shutdowns                    : 20
media_errors                        : 0
num_err_log_entries                 : 1
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 49 C
Temperature Sensor 2                : 58 C
Temperature Sensor 3                : 0 C
Temperature Sensor 4                : 0 C
Temperature Sensor 5                : 0 C
Temperature Sensor 6                : 0 C
Temperature Sensor 7                : 0 C
Temperature Sensor 8                : 0 C

[liveuser@localhost ~]$ sudo nvme smart-log-add /dev/nvme0n1
NVMe Status:INVALID_LOG_PAGE(2109)


[liveuser@localhost nvmetest]$ sudo smartctl -a /dev/nvme0n1p3
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-4.5.5-300.fc24.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 1TB
Serial Number:                      S59ANM0R108267E
Firmware Version:                   2B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 1,000,204,886,016 [1.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Utilization:            897,457,152,000 [897 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Thu Jun  3 14:32:39 2021 EDT
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.80W       -        -    0  0  0  0        0       0
 1 +     6.00W       -        -    1  1  1  1        0       0
 2 +     3.40W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0100W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        62 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    10,303,901 [5.27 TB]
Data Units Written:                 5,869,634 [3.00 TB]
Host Read Commands:                 53,748,047
Host Write Commands:                75,706,248
Controller Busy Time:               278
Power Cycles:                       138
Power On Hours:                     508
Unsafe Shutdowns:                   28
Media and Data Integrity Errors:    0
Error Information Log Entries:      2
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               62 Celsius
Temperature Sensor 2:               81 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

坏块

[liveuser@localhost ~]$ sudo badblocks -v /dev/nvme0n1p3 > badsectors
Checking blocks 0 to 972042950
Checking for bad blocks (read-only test): done                                                
Pass completed, 0 bad blocks found. (0/0/0 errors)

[liveuser@localhost ~]$ sudo badblocks -v /dev/nvme0n1p2 > badsectors.swap
Checking blocks 0 to 4194303
Checking for bad blocks (read-only test): done                                                
Pass completed, 0 bad blocks found. (0/0/0 errors)

[liveuser@localhost ~]$ ls -l
total 32
-rw-rw-r--. 1 liveuser liveuser    0 Jun  1 22:59 badsectors
-rw-rw-r--. 1 liveuser liveuser    0 Jun  1 23:38 badsectors.swap

文件系统检查


[liveuser@localhost e2fsck]$ sudo ./e2fsck -p /dev/nvme0n1p3
/dev/nvme0n1p3: recovering journal
/dev/nvme0n1p3: Clearing orphaned inode 10395651 (uid=1000, gid=1000, mode=040700, size=4096)
/dev/nvme0n1p3: Clearing orphaned inode 10377883 (uid=1000, gid=1000, mode=0100600, size=4194304)
/dev/nvme0n1p3: Clearing orphaned inode 9963223 (uid=1000, gid=1000, mode=0100644, size=45036)
/dev/nvme0n1p3: clean, 781268/60760064 files, 217936225/243010737 blocks
[liveuser@localhost e2fsck]$ sudo ./e2fsck -p /dev/nvme0n1p3
/dev/nvme0n1p3: clean, 781268/60760064 files, 217936225/243010737 blocks


[liveuser@localhost e2fsck]$ sudo ./e2fsck -f /dev/nvme0n1p3
e2fsck 1.46.2 (28-Feb-2021)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/nvme0n1p3: 781268/60760064 files (1.0% non-contiguous), 217936225/243010737 blocks


[liveuser@localhost e2fsck]$ sudo ./e2fsck -c /dev/nvme0n1p3
e2fsck 1.46.2 (28-Feb-2021)
Checking for bad blocks (read-only test):   0.00% done, 0:00 elapsed. (0/0/0 errdone                                                
/dev/nvme0n1p3: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****
/dev/nvme0n1p3: 781268/60760064 files (1.0% non-contiguous), 217936225/243010737 blocks


[liveuser@localhost e2fsck]$ sudo ./e2fsck -cvf /dev/nvme0n1p3
e2fsck 1.46.2 (28-Feb-2021)
Checking for bad blocks (read-only test):   0.00% done, 0:00 elapsed. (0/0/0 errdone                                                
/dev/nvme0n1p3: Updating bad block inode.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information

/dev/nvme0n1p3: ***** FILE SYSTEM WAS MODIFIED *****

      781268 inodes used (1.29%, out of 60760064)
        6973 non-contiguous files (0.9%)
         470 non-contiguous directories (0.1%)
             # of inodes with ind/dind/tind blocks: 0/0/0
             Extent depth histogram: 747824/2073
   217936225 blocks used (89.68%, out of 243010737)
           0 bad blocks
          21 large files

      684923 regular files
       58236 directories
           0 character device files
           0 block device files
           1 fifo
        1957 links
       38095 symbolic links (31358 fast symbolic links)
           4 sockets
------------
      783216 files

我尝试检查 journalctl,但似乎没有及时报告任何错误或在失败之前将其刷新到磁盘

[liveuser@localhost ~]$ sudo nvme list
Node             SN                   Model                                    Namespace Usage                      Format           FW Rev 
---------------- -------------------- ---------------------------------------- --------- -------------------------- ---------------- --------
/dev/nvme0n1     S59ANM0R108267E      Samsung SSD 970 EVO Plus 1TB             1         976.58  GB /   1.00  TB    512   B +  0 B   2B2QEXM7



[liveuser@localhost ~]$ sudo fstrim --verbose nvme
nvme: 95.5 GiB (102578487296 bytes) trimmed

答案1

运行一周后没有出现问题,我相信我的问题与此处报告的 APST 错误有关:https://bugzilla.kernel.org/show_bug.cgi?id=195039

我已经使用下面的内核启动参数禁用了 APST,现在没有任何问题(但显然需要找到比完全禁用更好的解决方案)

nvme_core.default_ps_max_latency_us=0

我通过启动实时磁盘并安装驱动器、进入驱动器并运行 IO 压力测试(我使用了 Phoronix)直到驱动器开始出现故障,从而获得了更多信息。然后我能够退出 chroot 并阅读 dmesg,发现大量类似于此的错误

EXT4-fs error (device nvme0n1p3): ext4_find_entry

答案2

SSD 是否可能发生故障,但没有报告坏扇区?

是的。错误报告(如 SMART)中的异常与较高的驱动器故障率相关,但并非所有故障都会先出现错误报告。

相关内容