我有一个 RAID-1 SSD 阵列(三星 970 EVO Plus),并且显示错误/var/log/syslog
,但smartctl
报告驱动器正常。我做了大量诊断(如下),我想知道我还能做些什么。是否发生了问题,如果是,最好的处理方案是什么?(在 Kubuntu 18.04.6 LTS 上。)
数组如下:
$ cat /proc/mdstat
md1 : active raid1 nvme0n1p3[0] nvme1n1p3[2]
1919724608 blocks super 1.2 [2/2] [UU]
bitmap: 5/15 pages [20KB], 65536KB chunk
据称,它看起来很健康mdadm
:
$ sudo mdadm --detail /dev/md1
/dev/md1:
Version : 1.2
Creation Time : Sat Feb 29 12:33:09 2020
Raid Level : raid1
Array Size : 1919724608 (1830.79 GiB 1965.80 GB)
Used Dev Size : 1919724608 (1830.79 GiB 1965.80 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Fri Dec 31 14:04:55 2021
State : clean
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : kubuntu:1
UUID : 7c84adca:31e96bad:b1be03ae:d7d0349d
Events : 41087
Number Major Minor RaidDevice State
0 259 3 0 active sync /dev/nvme0n1p3
2 259 7 1 active sync /dev/nvme1n1p3
然而,在三元组中,一些读取错误开始出现/var/log/syslog
:
Dec 31 12:32:56 kernel: [662973.969218] blk_update_request: critical medium error, dev nvme1n1, sector 2769948928 op 0x0:(READ) flags 0x0 phys_seg 9 prio class 0
Dec 31 12:32:56 kernel: [662973.969222] md/raid1:md1: nvme1n1p3: rescheduling sector 2702369024
Dec 31 12:32:56 kernel: [662973.978792] md/raid1:md1: redirecting sector 2702369024 to other mirror: nvme0n1p3
Dec 31 12:43:11 kernel: [663588.474940] blk_update_request: critical medium error, dev nvme0n1, sector 1815443200 op 0x0:(READ) flags 0x0 phys_seg 33 prio class 0
Dec 31 12:43:11 kernel: [663588.474943] md/raid1:md1: nvme0n1p3: rescheduling sector 1747863296
Dec 31 12:43:11 kernel: [663588.499466] md/raid1:md1: redirecting sector 1747863296 to other mirror: nvme0n1p3
有时后面还会跟着:
kernel: [313519.337578] md/raid1:md1: read error corrected (8 sectors at 1367197592 on nvme1n1p3)
我跑去smartctl
寻找问题。它表明过去发生过错误,但它也说“SMART 整体健康自我评估测试结果:通过“”。
对于 /dev/nvme0n1:
$ sudo smartctl -a /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO 2TB
Serial Number: S464NB0M406242D
Firmware Version: 2B2QEXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,017,558,851,584 [1.01 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Dec 31 14:01:33 2021 EST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 46 Celsius
Available Spare: 73%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 232,548,547 [119 TB]
Data Units Written: 58,761,625 [30.0 TB]
Host Read Commands: 1,144,416,417
Host Write Commands: 1,551,430,546
Controller Busy Time: 7,250
Power Cycles: 114
Power On Hours: 6,365
Unsafe Shutdowns: 73
Media and Data Integrity Errors: 694
Error Information Log Entries: 926
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 46 Celsius
Temperature Sensor 2: 50 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 926 28 0x0370 0xc502 0x000 3738332404 1 -
1 925 6 0x015b 0xc502 0x000 2503721366 1 -
2 924 22 0x0000 0xc502 0x000 1963251598 1 -
3 923 11 0x038a 0xc502 0x000 1862557082 1 -
4 922 16 0x00d1 0xc502 0x000 1862557082 1 -
5 921 6 0x0141 0xc502 0x000 1826459600 1 -
6 920 20 0x03b5 0xc502 0x000 1815443442 1 -
7 919 8 0x034d 0xc502 0x000 2588273810 1 -
8 918 11 0x0315 0xc502 0x000 2583041964 1 -
9 917 9 0x02e3 0xc502 0x000 2583041964 1 -
10 916 11 0x030e 0xc502 0x000 2583023500 1 -
11 915 11 0x0308 0xc502 0x000 2583023468 1 -
12 914 11 0x033a 0xc502 0x000 2583023500 1 -
13 913 9 0x02ec 0xc502 0x000 2583023468 1 -
14 912 14 0x03d2 0xc502 0x000 2472005420 1 -
15 911 23 0x00cd 0xc502 0x000 2444721868 1 -
... (32 entries not shown)
/dev/nvme1n1:
$ sudo smartctl -a /dev/nvme1n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO 2TB
Serial Number: S464NB0M403333H
Firmware Version: 2B2QEXE7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 1,044,938,612,736 [1.04 TB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Fri Dec 31 14:03:07 2021 EST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02, NSID 0x1)
Critical Warning: 0x00
Temperature: 45 Celsius
Available Spare: 81%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 180,057,901 [92.1 TB]
Data Units Written: 77,700,415 [39.7 TB]
Host Read Commands: 801,630,346
Host Write Commands: 1,566,190,001
Controller Busy Time: 6,925
Power Cycles: 156
Power On Hours: 6,260
Unsafe Shutdowns: 86
Media and Data Integrity Errors: 721
Error Information Log Entries: 1,015
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 45 Celsius
Temperature Sensor 2: 52 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 1015 22 0x0178 0xc502 0x000 2395920012 1 -
1 1014 31 0x02d6 0xc502 0x000 2065018576 1 -
2 1013 10 0x004e 0xc502 0x000 1928508102 1 -
3 1012 6 0x02aa 0xc502 0x000 2769949126 1 -
4 1011 27 0x0204 0xc502 0x000 2180665946 1 -
5 1010 27 0x023b 0xc502 0x000 2180598396 1 -
6 1009 14 0x00ee 0xc502 0x000 2562333810 1 -
7 1008 13 0x0075 0xc502 0x000 2423243572 1 -
8 1007 30 0x03bb 0xc502 0x000 2326927278 1 -
9 1006 24 0x03e6 0xc502 0x000 1775468746 1 -
10 1005 16 0x0066 0xc502 0x000 1775468746 1 -
11 1004 23 0x0148 0xc502 0x000 2813092280 1 -
12 1003 26 0x02fa 0xc502 0x000 2452856518 1 -
13 1002 5 0x03b1 0xc502 0x000 2119789206 1 -
14 1001 27 0x009b 0xc502 0x000 3047371772 1 -
15 1000 5 0x036c 0xc502 0x000 3047371772 1 -
... (5 entries not shown)
这两个驱动器似乎不支持自我检测(smartctl -c
根本没有列出任何自我检测)。
$ sudo smartctl -c /dev/nvme0n1
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-5.4.0-91-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL *Other*
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat *Other*
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 82 Celsius
Critical Comp. Temp. Threshold: 82 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 6.20W - - 0 0 0 0 0 0
1 + 4.30W - - 1 1 1 1 0 0
2 + 2.10W - - 2 2 2 2 0 0
3 - 0.0400W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
更新我的问题:
一些错误似乎归因于checkray 脚本每月运行一次,因为错误开始于“每月第一个星期日,凌晨 01:06”。“man md”添加:
[在] RAID1 上,软件问题可能会导致报告 [两个磁盘之间] 不匹配。这并不一定意味着阵列上的数据已损坏。可能只是因为系统不关心阵列的该部分存储了什么 - 它是未使用的空间。如果阵列上存储了交换分区或交换文件,则最有可能导致 RAID1 或 RAID10 出现意外不匹配。
下一步我该怎么做?非常感谢。