为什么 MDADM 从阵列中删除没有错误的磁盘并将其标记为故障

为什么 MDADM 从阵列中删除没有错误的磁盘并将其标记为故障

我有一个 Raid6,即使长时间测试没有显示任何错误,它也会不断自动将磁盘(/dev/sdi)标记为故障并将其从阵列中删除:

检测结果

sudo smartctl -l selftest /dev/sdi

Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20842         -

该磁盘是安装在服务器机架上的 SATA 10 TB Seagate IronWolf(开发型号 ST10000VN0008)。(机架中还有另外 23 个磁盘,总共 24 个。)

有一天,磁盘被标记为故障,然后从阵列中移除。我尝试将其重新添加到阵列,但不到一分钟它又恢复为故障状态。

/dev/md0:
           Version : 1.2
     Creation Time : Sun Nov 22 15:36:59 2020
        Raid Level : raid6
        Array Size : 214858671104 (200.10 TiB 220.02 TB)
     Used Dev Size : 9766303232 (9.10 TiB 10.00 TB)
      Raid Devices : 24
     Total Devices : 24
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Wed Nov 30 10:57:55 2022
             State : active, degraded 
    Active Devices : 23
   Working Devices : 23
    Failed Devices : 1
     Spare Devices : 0

            Layout : left-symmetric
        Chunk Size : 512K

Consistency Policy : bitmap

              Name : fileserver1:0  (local to host fileserver1)
              UUID : 3d6:bb27:dfc55a:30118
            Events : 3055912

    Number   Major   Minor   RaidDevice State
      25       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
       2       8       33        2      active sync   /dev/sdc1
      24       8       49        3      active sync   /dev/sdd1
      29       8       65        4      active sync   /dev/sde1
       5       8       81        5      active sync   /dev/sdf1
       6       8       97        6      active sync   /dev/sdg1
       7       8      113        7      active sync   /dev/sdh1
       -       0        0        8      removed
       9       8      145        9      active sync   /dev/sdj1
      10       8      161       10      active sync   /dev/sdk1
      30       8      177       11      active sync   /dev/sdl1
      12       8      193       12      active sync   /dev/sdm1
      13      65        1       13      active sync   /dev/sdq1
      27      65       17       14      active sync   /dev/sdr1
      15      65       49       15      active sync   /dev/sdt1
      16      65       33       16      active sync   /dev/sds1
      17      65       81       17      active sync   /dev/sdv1
      18      65       65       18      active sync   /dev/sdu1
      19      65       97       19      active sync   /dev/sdw1
      20      65      113       20      active sync   /dev/sdx1
      26       8      209       21      active sync   /dev/sdn1
      22       8      241       22      active sync   /dev/sdp1
      28       8      225       23      active sync   /dev/sdo1

       8       8      129        -      faulty   /dev/sdi1

通过查看 mdadm -a /dev/sdi 一切看起来都很好,据我所知没有错误:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   072   064   044    Pre-fail  Always       -       16090475
  3 Spin_Up_Time            0x0003   096   088   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       51
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   090   060   045    Pre-fail  Always       -       899806167
  9 Power_On_Hours          0x0032   077   077   000    Old_age   Always       -       21020
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       51
 18 Head_Health             0x000b   100   100   050    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   095   000    Old_age   Always       -       49
190 Airflow_Temperature_Cel 0x0022   062   045   040    Old_age   Always       -       38 (Min/Max 38/38)
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   097   097   000    Old_age   Always       -       6022
194 Temperature_Celsius     0x0022   038   044   000    Old_age   Always       -       38 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   072   064   000    Old_age   Always       -       16090475
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Pressure_Limit          0x0023   100   100   001    Pre-fail  Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       18295 (122 185 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       127160839683
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       936449414976

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     20842         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

系统日志

这是尝试将磁盘添加回阵列时的系统日志,我不太明白,但显然有些错误:

 24 Nov 30 10:49:41 fileserver1 kernel: [1272353.035760] sd 0:0:24:0: [sdi] tag#2490 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
 23 Nov 30 10:49:41 fileserver1 kernel: [1272353.035769] sd 0:0:24:0: [sdi] tag#2490 Sense Key : Not Ready [current]                                                                                               
 22 Nov 30 10:49:41 fileserver1 kernel: [1272353.035772] sd 0:0:24:0: [sdi] tag#2490 Add. Sense: Logical unit not ready, cause not reportable                                                                      
 21 Nov 30 10:49:41 fileserver1 kernel: [1272353.035775] sd 0:0:24:0: [sdi] tag#2490 CDB: Write(16) 8a 00 00 00 00 00 44 bb 55 00 00 00 04 00 00 00                                                                
 20 Nov 30 10:49:41 fileserver1 kernel: [1272353.035777] print_req_error: 5 callbacks suppressed                                                                                                                   
 19 Nov 30 10:49:41 fileserver1 kernel: [1272353.035778] blk_update_request: I/O error, dev sdi, sector 1153127680 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
 18 Nov 30 10:49:41 fileserver1 kernel: [1272353.035980] sd 0:0:24:0: [sdi] tag#2493 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
 17 Nov 30 10:49:41 fileserver1 kernel: [1272353.035984] sd 0:0:24:0: [sdi] tag#2493 Sense Key : Not Ready [current]                                                                                               
 16 Nov 30 10:49:41 fileserver1 kernel: [1272353.035986] sd 0:0:24:0: [sdi] tag#2493 Add. Sense: Logical unit not ready, cause not reportable                                                                      
 15 Nov 30 10:49:41 fileserver1 kernel: [1272353.035988] sd 0:0:24:0: [sdi] tag#2493 CDB: Write(16) 8a 00 00 00 00 00 44 bb 59 00 00 00 04 00 00 00                                                                
 14 Nov 30 10:49:41 fileserver1 kernel: [1272353.035989] blk_update_request: I/O error, dev sdi, sector 1153128704 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
 13 Nov 30 10:49:41 fileserver1 kernel: [1272353.036121] sd 0:0:24:0: [sdi] tag#2494 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
 12 Nov 30 10:49:41 fileserver1 kernel: [1272353.036123] sd 0:0:24:0: [sdi] tag#2494 Sense Key : Not Ready [current]                                                                                               
 11 Nov 30 10:49:41 fileserver1 kernel: [1272353.036124] sd 0:0:24:0: [sdi] tag#2494 Add. Sense: Logical unit not ready, cause not reportable                                                                      
 10 Nov 30 10:49:41 fileserver1 kernel: [1272353.036126] sd 0:0:24:0: [sdi] tag#2494 CDB: Write(16) 8a 00 00 00 00 00 44 bb 5d 00 00 00 04 00 00 00                                                                
  9 Nov 30 10:49:41 fileserver1 kernel: [1272353.036127] blk_update_request: I/O error, dev sdi, sector 1153129728 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
  8 Nov 30 10:49:41 fileserver1 kernel: [1272353.036295] sd 0:0:24:0: [sdi] tag#2432 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
  7 Nov 30 10:49:41 fileserver1 kernel: [1272353.036297] sd 0:0:24:0: [sdi] tag#2432 Sense Key : Not Ready [current]                                                                                               
  6 Nov 30 10:49:41 fileserver1 kernel: [1272353.036299] sd 0:0:24:0: [sdi] tag#2432 Add. Sense: Logical unit not ready, cause not reportable                                                                      
  5 Nov 30 10:49:41 fileserver1 kernel: [1272353.036300] sd 0:0:24:0: [sdi] tag#2432 CDB: Write(16) 8a 00 00 00 00 00 44 bb 61 00 00 00 04 00 00 00                                                                
  4 Nov 30 10:49:41 fileserver1 kernel: [1272353.036301] blk_update_request: I/O error, dev sdi, sector 1153130752 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
  3 Nov 30 10:49:41 fileserver1 kernel: [1272353.036432] sd 0:0:24:0: [sdi] tag#2433 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
  2 Nov 30 10:49:41 fileserver1 kernel: [1272353.036434] sd 0:0:24:0: [sdi] tag#2433 Sense Key : Not Ready [current]                                                                                               
  1 Nov 30 10:49:41 fileserver1 kernel: [1272353.036436] sd 0:0:24:0: [sdi] tag#2433 Add. Sense: Logical unit not ready, cause not reportable                                                                      
  0 Nov 30 10:49:41 fileserver1 kernel: [1272353.036437] sd 0:0:24:0: [sdi] tag#2433 CDB: Write(16) 8a 00 00 00 00 00 44 bb 65 00 00 00 04 00 00 00                                                                
  1 Nov 30 10:49:41 fileserver1 kernel: [1272353.036438] blk_update_request: I/O error, dev sdi, sector 1153131776 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
  2 Nov 30 10:49:41 fileserver1 kernel: [1272353.036582] sd 0:0:24:0: [sdi] tag#2434 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
  3 Nov 30 10:49:41 fileserver1 kernel: [1272353.036584] sd 0:0:24:0: [sdi] tag#2434 Sense Key : Not Ready [current]                                                                                               
  4 Nov 30 10:49:41 fileserver1 kernel: [1272353.036585] sd 0:0:24:0: [sdi] tag#2434 Add. Sense: Logical unit not ready, cause not reportable                                                                      
  5 Nov 30 10:49:41 fileserver1 kernel: [1272353.036587] sd 0:0:24:0: [sdi] tag#2434 CDB: Write(16) 8a 00 00 00 00 00 44 bb 69 00 00 00 04 00 00 00                                                                
  6 Nov 30 10:49:41 fileserver1 kernel: [1272353.036588] blk_update_request: I/O error, dev sdi, sector 1153132800 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
  7 Nov 30 10:49:41 fileserver1 kernel: [1272353.036740] sd 0:0:24:0: [sdi] tag#2435 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
  8 Nov 30 10:49:41 fileserver1 kernel: [1272353.036742] sd 0:0:24:0: [sdi] tag#2435 Sense Key : Not Ready [current]                                                                                               
  9 Nov 30 10:49:41 fileserver1 kernel: [1272353.036743] sd 0:0:24:0: [sdi] tag#2435 Add. Sense: Logical unit not ready, cause not reportable                                                                      
 10 Nov 30 10:49:41 fileserver1 kernel: [1272353.036745] sd 0:0:24:0: [sdi] tag#2435 CDB: Write(16) 8a 00 00 00 00 00 44 bb 6d 00 00 00 04 00 00 00                                                                
 11 Nov 30 10:49:41 fileserver1 kernel: [1272353.036746] blk_update_request: I/O error, dev sdi, sector 1153133824 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
 12 Nov 30 10:49:41 fileserver1 kernel: [1272353.036882] sd 0:0:24:0: [sdi] tag#2436 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
 13 Nov 30 10:49:41 fileserver1 kernel: [1272353.036884] sd 0:0:24:0: [sdi] tag#2436 Sense Key : Not Ready [current]                                                                                               
 14 Nov 30 10:49:41 fileserver1 kernel: [1272353.036886] sd 0:0:24:0: [sdi] tag#2436 Add. Sense: Logical unit not ready, cause not reportable                                                                      
 15 Nov 30 10:49:41 fileserver1 kernel: [1272353.036887] sd 0:0:24:0: [sdi] tag#2436 CDB: Write(16) 8a 00 00 00 00 00 44 bb 71 00 00 00 04 00 00 00                                                                
 16 Nov 30 10:49:41 fileserver1 kernel: [1272353.036888] blk_update_request: I/O error, dev sdi, sector 1153134848 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
 17 Nov 30 10:49:41 fileserver1 kernel: [1272353.037023] sd 0:0:24:0: [sdi] tag#2437 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
 18 Nov 30 10:49:41 fileserver1 kernel: [1272353.037025] sd 0:0:24:0: [sdi] tag#2437 Sense Key : Not Ready [current]                                                                                               
 19 Nov 30 10:49:41 fileserver1 kernel: [1272353.037027] sd 0:0:24:0: [sdi] tag#2437 Add. Sense: Logical unit not ready, cause not reportable                                                                      
 20 Nov 30 10:49:41 fileserver1 kernel: [1272353.037028] sd 0:0:24:0: [sdi] tag#2437 CDB: Write(16) 8a 00 00 00 00 00 44 bb 75 00 00 00 04 00 00 00                                                                
 21 Nov 30 10:49:41 fileserver1 kernel: [1272353.037029] blk_update_request: I/O error, dev sdi, sector 1153135872 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
 22 Nov 30 10:49:41 fileserver1 kernel: [1272353.037194] sd 0:0:24:0: [sdi] tag#2438 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=4s                                                                
 23 Nov 30 10:49:41 fileserver1 kernel: [1272353.037196] sd 0:0:24:0: [sdi] tag#2438 Sense Key : Not Ready [current]                                                                                               
 24 Nov 30 10:49:41 fileserver1 kernel: [1272353.037197] sd 0:0:24:0: [sdi] tag#2438 Add. Sense: Logical unit not ready, cause not reportable                                                                      
 25 Nov 30 10:49:41 fileserver1 kernel: [1272353.037199] sd 0:0:24:0: [sdi] tag#2438 CDB: Write(16) 8a 00 00 00 00 00 44 bb 49 00 00 00 04 00 00 00                                                                
 26 Nov 30 10:49:41 fileserver1 kernel: [1272353.037199] blk_update_request: I/O error, dev sdi, sector 1153124608 op 0x1:(WRITE) flags 0x4000 phys_seg 128 prio class 0                                           
 27 Nov 30 10:49:41 fileserver1 kernel: [1272353.285669] md: super_written gets error=-5                                                                                                                           
 28 Nov 30 10:49:41 fileserver1 kernel: [1272353.285676] md/raid:md0: Disk failure on sdi1, disabling device.   

有人知道发生了什么事吗?

答案1

感谢@roaima 的回答(评论)。这个问题实际上不是由于磁盘故障,而是由于 RAID 卡和磁盘之间的背板故障。

更换背板后,问题消失。

相关内容