RAID 阵列显示“严重介质错误”，但 smartctl 显示磁盘健康 - 下一步该怎么办？

2024-6-5 • tag-icon

RAID 阵列显示“严重介质错误”，但 smartctl 显示磁盘健康 - 下一步该怎么办？

我有一个 RAID-1 SSD 阵列（三星 970 EVO Plus），并且显示错误/var/log/syslog，但smartctl报告驱动器正常。我做了大量诊断（如下），我想知道我还能做些什么。是否发生了问题，如果是，最好的处理方案是什么？（在 Kubuntu 18.04.6 LTS 上。）

数组如下：

$ cat /proc/mdstat
md1 : active raid1 nvme0n1p3[0] nvme1n1p3[2]
      1919724608 blocks super 1.2 [2/2] [UU]
      bitmap: 5/15 pages [20KB], 65536KB chunk

据称，它看起来很健康mdadm：

$ sudo mdadm --detail /dev/md1
/dev/md1:
           Version : 1.2
     Creation Time : Sat Feb 29 12:33:09 2020
        Raid Level : raid1
        Array Size : 1919724608 (1830.79 GiB 1965.80 GB)
     Used Dev Size : 1919724608 (1830.79 GiB 1965.80 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Fri Dec 31 14:04:55 2021
             State : clean 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

              Name : kubuntu:1
              UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
            Events : 41087

    Number   Major   Minor   RaidDevice State
       0     259        3        0      active sync   /dev/nvme0n1p3
       2     259        7        1      active sync   /dev/nvme1n1p3

然而，在三元组中，一些读取错误开始出现/var/log/syslog：

Dec 31 12:32:56  kernel: [662973.969218] blk_update_request: critical medium error, dev nvme1n1, sector 2769948928 op 0x0:(READ) flags 0x0 phys_seg 9 prio class 0
Dec 31 12:32:56  kernel: [662973.969222] md/raid1:md1: nvme1n1p3: rescheduling sector 2702369024
Dec 31 12:32:56  kernel: [662973.978792] md/raid1:md1: redirecting sector 2702369024 to other mirror: nvme0n1p3

Dec 31 12:43:11  kernel: [663588.474940] blk_update_request: critical medium error, dev nvme0n1, sector 1815443200 op 0x0:(READ) flags 0x0 phys_seg 33 prio class 0
Dec 31 12:43:11  kernel: [663588.474943] md/raid1:md1: nvme0n1p3: rescheduling sector 1747863296
Dec 31 12:43:11  kernel: [663588.499466] md/raid1:md1: redirecting sector 1747863296 to other mirror: nvme0n1p3

有时后面还会跟着：

kernel: [313519.337578] md/raid1:md1: read error corrected (8 sectors at 1367197592 on nvme1n1p3)

我跑去smartctl寻找问题。它表明过去发生过错误，但它也说“SMART 整体健康自我评估测试结果：通过“”。

对于 /dev/nvme0n1：

$ sudo smartctl -a /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M406242D
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,017,558,851,584 [1.01 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec 31 14:01:33 2021 EST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        46 Celsius
Available Spare:                    73%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    232,548,547 [119 TB]
Data Units Written:                 58,761,625 [30.0 TB]
Host Read Commands:                 1,144,416,417
Host Write Commands:                1,551,430,546
Controller Busy Time:               7,250
Power Cycles:                       114
Power On Hours:                     6,365
Unsafe Shutdowns:                   73
Media and Data Integrity Errors:    694
Error Information Log Entries:      926
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               46 Celsius
Temperature Sensor 2:               50 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0        926    28  0x0370  0xc502  0x000   3738332404     1     -
  1        925     6  0x015b  0xc502  0x000   2503721366     1     -
  2        924    22  0x0000  0xc502  0x000   1963251598     1     -
  3        923    11  0x038a  0xc502  0x000   1862557082     1     -
  4        922    16  0x00d1  0xc502  0x000   1862557082     1     -
  5        921     6  0x0141  0xc502  0x000   1826459600     1     -
  6        920    20  0x03b5  0xc502  0x000   1815443442     1     -
  7        919     8  0x034d  0xc502  0x000   2588273810     1     -
  8        918    11  0x0315  0xc502  0x000   2583041964     1     -
  9        917     9  0x02e3  0xc502  0x000   2583041964     1     -
 10        916    11  0x030e  0xc502  0x000   2583023500     1     -
 11        915    11  0x0308  0xc502  0x000   2583023468     1     -
 12        914    11  0x033a  0xc502  0x000   2583023500     1     -
 13        913     9  0x02ec  0xc502  0x000   2583023468     1     -
 14        912    14  0x03d2  0xc502  0x000   2472005420     1     -
 15        911    23  0x00cd  0xc502  0x000   2444721868     1     -
... (32 entries not shown)

/dev/nvme1n1：

$ sudo smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO 2TB
Serial Number:                      S464NB0M403333H
Firmware Version:                   2B2QEXE7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            1,044,938,612,736 [1.04 TB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Fri Dec 31 14:03:07 2021 EST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning:                   0x00
Temperature:                        45 Celsius
Available Spare:                    81%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    180,057,901 [92.1 TB]
Data Units Written:                 77,700,415 [39.7 TB]
Host Read Commands:                 801,630,346
Host Write Commands:                1,566,190,001
Controller Busy Time:               6,925
Power Cycles:                       156
Power On Hours:                     6,260
Unsafe Shutdowns:                   86
Media and Data Integrity Errors:    721
Error Information Log Entries:      1,015
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               45 Celsius
Temperature Sensor 2:               52 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS
  0       1015    22  0x0178  0xc502  0x000   2395920012     1     -
  1       1014    31  0x02d6  0xc502  0x000   2065018576     1     -
  2       1013    10  0x004e  0xc502  0x000   1928508102     1     -
  3       1012     6  0x02aa  0xc502  0x000   2769949126     1     -
  4       1011    27  0x0204  0xc502  0x000   2180665946     1     -
  5       1010    27  0x023b  0xc502  0x000   2180598396     1     -
  6       1009    14  0x00ee  0xc502  0x000   2562333810     1     -
  7       1008    13  0x0075  0xc502  0x000   2423243572     1     -
  8       1007    30  0x03bb  0xc502  0x000   2326927278     1     -
  9       1006    24  0x03e6  0xc502  0x000   1775468746     1     -
 10       1005    16  0x0066  0xc502  0x000   1775468746     1     -
 11       1004    23  0x0148  0xc502  0x000   2813092280     1     -
 12       1003    26  0x02fa  0xc502  0x000   2452856518     1     -
 13       1002     5  0x03b1  0xc502  0x000   2119789206     1     -
 14       1001    27  0x009b  0xc502  0x000   3047371772     1     -
 15       1000     5  0x036c  0xc502  0x000   3047371772     1     -
... (5 entries not shown)

这两个驱动器似乎不支持自我检测（smartctl -c根本没有列出任何自我检测）。

$ sudo smartctl -c /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     82 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.20W       -        -    0  0  0  0        0       0
 1 +     4.30W       -        -    1  1  1  1        0       0
 2 +     2.10W       -        -    2  2  2  2        0       0
 3 -   0.0400W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

更新我的问题：

一些错误似乎归因于checkray 脚本每月运行一次，因为错误开始于“每月第一个星期日，凌晨 01:06”。“man md”添加：

[在] RAID1 上，软件问题可能会导致报告 [两个磁盘之间] 不匹配。这并不一定意味着阵列上的数据已损坏。可能只是因为系统不关心阵列的该部分存储了什么 - 它是未使用的空间。如果阵列上存储了交换分区或交换文件，则最有可能导致 RAID1 或 RAID10 出现意外不匹配。

下一步我该怎么做？非常感谢。

相关内容