我有一个 Raid6,即使长时间测试没有显示任何错误,它也会不断自动将磁盘(/dev/sdi)标记为故障并将其从阵列中删除:
检测结果
sudo smartctl -l selftest /dev/sdi
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 20842 -
该磁盘是安装在服务器机架上的 SATA 10 TB Seagate IronWolf(开发型号 ST10000VN0008)。(机架中还有另外 23 个磁盘,总共 24 个。)
有一天,磁盘被标记为故障,然后从阵列中移除。我尝试将其重新添加到阵列,但不到一分钟它又恢复为故障状态。
/dev/md0:
Version : 1.2
Creation Time : Sun Nov 22 15:36:59 2020
Raid Level : raid6
Array Size : 214858671104 (200.10 TiB 220.02 TB)
Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
Raid Devices : 24
Total Devices : 24
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Nov 30 10:57:55 2022
State : active, degraded
Active Devices : 23
Working Devices : 23
Failed Devices : 1
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Consistency Policy : bitmap
Name : fileserver1:0 (local to host fileserver1)
UUID : 3d6:bb27:dfc55a:30118
Events : 3055912
Number Major Minor RaidDevice State
25 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
2 8 33 2 active sync /dev/sdc1
24 8 49 3 active sync /dev/sdd1
29 8 65 4 active sync /dev/sde1
5 8 81 5 active sync /dev/sdf1
6 8 97 6 active sync /dev/sdg1
7 8 113 7 active sync /dev/sdh1
- 0 0 8 removed
9 8 145 9 active sync /dev/sdj1
10 8 161 10 active sync /dev/sdk1
30 8 177 11 active sync /dev/sdl1
12 8 193 12 active sync /dev/sdm1
13 65 1 13 active sync /dev/sdq1
27 65 17 14 active sync /dev/sdr1
15 65 49 15 active sync /dev/sdt1
16 65 33 16 active sync /dev/sds1
17 65 81 17 active sync /dev/sdv1
18 65 65 18 active sync /dev/sdu1
19 65 97 19 active sync /dev/sdw1
20 65 113 20 active sync /dev/sdx1
26 8 209 21 active sync /dev/sdn1
22 8 241 22 active sync /dev/sdp1
28 8 225 23 active sync /dev/sdo1
8 8 129 - faulty /dev/sdi1
通过查看 mdadm -a /dev/sdi 一切看起来都很好,据我所知没有错误:
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 072 064 044 Pre-fail Always - 16090475
3 Spin_Up_Time 0x0003 096 088 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 51
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 899806167
9 Power_On_Hours 0x0032 077 077 000 Old_age Always - 21020
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 51
18 Head_Health 0x000b 100 100 050 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 095 000 Old_age Always - 49
190 Airflow_Temperature_Cel 0x0022 062 045 040 Old_age Always - 38 (Min/Max 38/38)
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 37
193 Load_Cycle_Count 0x0032 097 097 000 Old_age Always - 6022
194 Temperature_Celsius 0x0022 038 044 000 Old_age Always - 38 (0 23 0 0 0)
195 Hardware_ECC_Recovered 0x001a 072 064 000 Old_age Always - 16090475
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
200 Pressure_Limit 0x0023 100 100 001 Pre-fail Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 18295 (122 185 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 127160839683
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 936449414976
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 20842 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
系统日志
这是尝试将磁盘添加回阵列时的系统日志,我不太明白,但显然有些错误:
24 Nov 30 10:49:41 fileserver1 kernel: [1272353.035760] sd 0:0:24:0: [sdi] tag#2490 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
23 Nov 30 10:49:41 fileserver1 kernel: [1272353.035769] sd 0:0:24:0: [sdi] tag#2490 Sense Key : Not Ready [current]
22 Nov 30 10:49:41 fileserver1 kernel: [1272353.035772] sd 0:0:24:0: [sdi] tag#2490 Add. Sense: Logical unit not ready, cause not reportable
21 Nov 30 10:49:41 fileserver1 kernel: [1272353.035775] sd 0:0:24:0: [sdi] tag#2490 CDB: Write(16) 8a 00 00 00 00 00 44 bb 55 00 00 00 04 00 00 00
20 Nov 30 10:49:41 fileserver1 kernel: [1272353.035777] print_req_error: 5 callbacks suppressed
19 Nov 30 10:49:41 fileserver1 kernel: [1272353.035778] blk_update_request: I/O error, dev sdi, sector 1153127680 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
18 Nov 30 10:49:41 fileserver1 kernel: [1272353.035980] sd 0:0:24:0: [sdi] tag#2493 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
17 Nov 30 10:49:41 fileserver1 kernel: [1272353.035984] sd 0:0:24:0: [sdi] tag#2493 Sense Key : Not Ready [current]
16 Nov 30 10:49:41 fileserver1 kernel: [1272353.035986] sd 0:0:24:0: [sdi] tag#2493 Add. Sense: Logical unit not ready, cause not reportable
15 Nov 30 10:49:41 fileserver1 kernel: [1272353.035988] sd 0:0:24:0: [sdi] tag#2493 CDB: Write(16) 8a 00 00 00 00 00 44 bb 59 00 00 00 04 00 00 00
14 Nov 30 10:49:41 fileserver1 kernel: [1272353.035989] blk_update_request: I/O error, dev sdi, sector 1153128704 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
13 Nov 30 10:49:41 fileserver1 kernel: [1272353.036121] sd 0:0:24:0: [sdi] tag#2494 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
12 Nov 30 10:49:41 fileserver1 kernel: [1272353.036123] sd 0:0:24:0: [sdi] tag#2494 Sense Key : Not Ready [current]
11 Nov 30 10:49:41 fileserver1 kernel: [1272353.036124] sd 0:0:24:0: [sdi] tag#2494 Add. Sense: Logical unit not ready, cause not reportable
10 Nov 30 10:49:41 fileserver1 kernel: [1272353.036126] sd 0:0:24:0: [sdi] tag#2494 CDB: Write(16) 8a 00 00 00 00 00 44 bb 5d 00 00 00 04 00 00 00
9 Nov 30 10:49:41 fileserver1 kernel: [1272353.036127] blk_update_request: I/O error, dev sdi, sector 1153129728 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
8 Nov 30 10:49:41 fileserver1 kernel: [1272353.036295] sd 0:0:24:0: [sdi] tag#2432 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
7 Nov 30 10:49:41 fileserver1 kernel: [1272353.036297] sd 0:0:24:0: [sdi] tag#2432 Sense Key : Not Ready [current]
6 Nov 30 10:49:41 fileserver1 kernel: [1272353.036299] sd 0:0:24:0: [sdi] tag#2432 Add. Sense: Logical unit not ready, cause not reportable
5 Nov 30 10:49:41 fileserver1 kernel: [1272353.036300] sd 0:0:24:0: [sdi] tag#2432 CDB: Write(16) 8a 00 00 00 00 00 44 bb 61 00 00 00 04 00 00 00
4 Nov 30 10:49:41 fileserver1 kernel: [1272353.036301] blk_update_request: I/O error, dev sdi, sector 1153130752 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
3 Nov 30 10:49:41 fileserver1 kernel: [1272353.036432] sd 0:0:24:0: [sdi] tag#2433 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
2 Nov 30 10:49:41 fileserver1 kernel: [1272353.036434] sd 0:0:24:0: [sdi] tag#2433 Sense Key : Not Ready [current]
1 Nov 30 10:49:41 fileserver1 kernel: [1272353.036436] sd 0:0:24:0: [sdi] tag#2433 Add. Sense: Logical unit not ready, cause not reportable
0 Nov 30 10:49:41 fileserver1 kernel: [1272353.036437] sd 0:0:24:0: [sdi] tag#2433 CDB: Write(16) 8a 00 00 00 00 00 44 bb 65 00 00 00 04 00 00 00
1 Nov 30 10:49:41 fileserver1 kernel: [1272353.036438] blk_update_request: I/O error, dev sdi, sector 1153131776 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
2 Nov 30 10:49:41 fileserver1 kernel: [1272353.036582] sd 0:0:24:0: [sdi] tag#2434 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
3 Nov 30 10:49:41 fileserver1 kernel: [1272353.036584] sd 0:0:24:0: [sdi] tag#2434 Sense Key : Not Ready [current]
4 Nov 30 10:49:41 fileserver1 kernel: [1272353.036585] sd 0:0:24:0: [sdi] tag#2434 Add. Sense: Logical unit not ready, cause not reportable
5 Nov 30 10:49:41 fileserver1 kernel: [1272353.036587] sd 0:0:24:0: [sdi] tag#2434 CDB: Write(16) 8a 00 00 00 00 00 44 bb 69 00 00 00 04 00 00 00
6 Nov 30 10:49:41 fileserver1 kernel: [1272353.036588] blk_update_request: I/O error, dev sdi, sector 1153132800 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
7 Nov 30 10:49:41 fileserver1 kernel: [1272353.036740] sd 0:0:24:0: [sdi] tag#2435 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
8 Nov 30 10:49:41 fileserver1 kernel: [1272353.036742] sd 0:0:24:0: [sdi] tag#2435 Sense Key : Not Ready [current]
9 Nov 30 10:49:41 fileserver1 kernel: [1272353.036743] sd 0:0:24:0: [sdi] tag#2435 Add. Sense: Logical unit not ready, cause not reportable
10 Nov 30 10:49:41 fileserver1 kernel: [1272353.036745] sd 0:0:24:0: [sdi] tag#2435 CDB: Write(16) 8a 00 00 00 00 00 44 bb 6d 00 00 00 04 00 00 00
11 Nov 30 10:49:41 fileserver1 kernel: [1272353.036746] blk_update_request: I/O error, dev sdi, sector 1153133824 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
12 Nov 30 10:49:41 fileserver1 kernel: [1272353.036882] sd 0:0:24:0: [sdi] tag#2436 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
13 Nov 30 10:49:41 fileserver1 kernel: [1272353.036884] sd 0:0:24:0: [sdi] tag#2436 Sense Key : Not Ready [current]
14 Nov 30 10:49:41 fileserver1 kernel: [1272353.036886] sd 0:0:24:0: [sdi] tag#2436 Add. Sense: Logical unit not ready, cause not reportable
15 Nov 30 10:49:41 fileserver1 kernel: [1272353.036887] sd 0:0:24:0: [sdi] tag#2436 CDB: Write(16) 8a 00 00 00 00 00 44 bb 71 00 00 00 04 00 00 00
16 Nov 30 10:49:41 fileserver1 kernel: [1272353.036888] blk_update_request: I/O error, dev sdi, sector 1153134848 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
17 Nov 30 10:49:41 fileserver1 kernel: [1272353.037023] sd 0:0:24:0: [sdi] tag#2437 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
18 Nov 30 10:49:41 fileserver1 kernel: [1272353.037025] sd 0:0:24:0: [sdi] tag#2437 Sense Key : Not Ready [current]
19 Nov 30 10:49:41 fileserver1 kernel: [1272353.037027] sd 0:0:24:0: [sdi] tag#2437 Add. Sense: Logical unit not ready, cause not reportable
20 Nov 30 10:49:41 fileserver1 kernel: [1272353.037028] sd 0:0:24:0: [sdi] tag#2437 CDB: Write(16) 8a 00 00 00 00 00 44 bb 75 00 00 00 04 00 00 00
21 Nov 30 10:49:41 fileserver1 kernel: [1272353.037029] blk_update_request: I/O error, dev sdi, sector 1153135872 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
22 Nov 30 10:49:41 fileserver1 kernel: [1272353.037194] sd 0:0:24:0: [sdi] tag#2438 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s
23 Nov 30 10:49:41 fileserver1 kernel: [1272353.037196] sd 0:0:24:0: [sdi] tag#2438 Sense Key : Not Ready [current]
24 Nov 30 10:49:41 fileserver1 kernel: [1272353.037197] sd 0:0:24:0: [sdi] tag#2438 Add. Sense: Logical unit not ready, cause not reportable
25 Nov 30 10:49:41 fileserver1 kernel: [1272353.037199] sd 0:0:24:0: [sdi] tag#2438 CDB: Write(16) 8a 00 00 00 00 00 44 bb 49 00 00 00 04 00 00 00
26 Nov 30 10:49:41 fileserver1 kernel: [1272353.037199] blk_update_request: I/O error, dev sdi, sector 1153124608 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0
27 Nov 30 10:49:41 fileserver1 kernel: [1272353.285669] md: super_written gets error=-5
28 Nov 30 10:49:41 fileserver1 kernel: [1272353.285676] md/raid:md0: Disk failure on sdi1, disabling device.
有人知道发生了什么事吗?
答案1
感谢@roaima 的回答(评论)。这个问题实际上不是由于磁盘故障,而是由于 RAID 卡和磁盘之间的背板故障。
更换背板后,问题消失。