smartctl“增长缺陷列表中的元素”与 RAID 控制器“介质错误计数”

smartctl“增长缺陷列表中的元素”与 RAID 控制器“介质错误计数”

我在服务器中使用带有 PERC810 控制器的硬件 raid50,最近遇到了一个我不确定的指标。到目前为止,我一直使用 smartctl 指标“元素在增长缺陷列表中”作为驱动器出现故障并应被移除的提示,但如果我使用 perccli(或 storcli/megacli),驱动器还会显示一个名为“介质错误计数”的指标。我遇到的问题是,从我读到的这些指标来看,它们基本上是同一件事 - 都显示磁盘上的重新分配扇区或物理缺陷。但我的一些硬盘在元素在增长缺陷列表中显示的数字大于零,但在介质错误计数处显示零值,反之亦然。例如这个磁盘:

perccli /c0/e37/s7 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.

Drive /c0/e37/s7 :

EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
37:7     72 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild

Drive /c0/e37/s7 - Detailed Information :

Drive /c0/e37/s7 State :
Shield Counter = 0
Media Error Count = 38
Other Error Count = 118063
Drive Temperature =  41C (105.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e37/s7 Device attributes :
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01F55DD1
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01

显示Media Error Count = 3,但是当我对同一个磁盘使用 smartctl 时:

smartctl -a -d megaraid,72 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke,

Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01f55dd0
Serial number:        WMC1F0D41KD5
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:14:51 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60298:10
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    2538437     9298     76289   2547735       9392     215124.761          94
write:   5550372  5405661   5407707  10956033    5405661     571404.363           0
verify:      184        0         0       184          0        352.277           0

Non-medium error count:   202249

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

表明Elements in grown defect list: 0


perccli /c0/e37/s4 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.

Drive /c0/e37/s4 :

EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
37:4     63 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild

Drive /c0/e37/s4 - Detailed Information :

Drive /c0/e37/s4 State :
Shield Counter = 0
Media Error Count = 0
Other Error Count = 118060
Drive Temperature =  35C (95.00 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No

Drive /c0/e37/s4 Device attributes :
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01352C35
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01 

Drive /c0/e37/s4 Policies/Settings :
Drive position = DriveGroup:1, Span:1, Row:0
Enclosure position = 0
Connected Port Number = 0(path0) 
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No

Port Information :

Port Status Linkspeed SAS address        
   0 Active 6.0Gb/s   0x50000c0f01352c36 
   1 Active Unknown   0x0                

Inquiry Data = 
00 00 06 12 5b 01 10 02 57 44 20 20 20 20 20 20 
57 44 34 30 30 31 46 59 59 47 2d 30 31 53 4c 33 
56 52 30 38 57 44 2d 57 4d 43 31 46 30 44 32 32 
32 4b 46 20 20 20 20 20 00 00 00 a0 0c 40 20 c0 
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

其中Media Error Count = 0,但是smartctl:

smartctl -a -d megaraid,63 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke,

Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01352c34
Serial number:        WMC1F0D222KF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:39:52 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60299:24
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 44

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    4899063        1         1   4899064          1     215489.217           0
write:   6593514      494       496   6594008        499     571584.348           0
verify:      345        0         0       345          0        349.197           0

Non-medium error count:   202287

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

演出Elements in grown defect list: 44




Media Error Count测量介质错误从 RAID 卡上可以看到

Elements in grown defect list显示增长列表的大小,或重新映射扇区的数量从驱动器本身来看


  • 当一个磁盘在另一个阵列或独立磁盘中积累了许多缺陷时,可以创建 RAID 阵列;
  • 磁盘背景表面扫描测试可以检测并重新映射任意数量的扇区,而不会让上层(即 RAID 卡)注意到;
  • 对有缺陷扇区的写入操作由磁盘本身“动态”重新映射,无需 RAID 卡的干预;
  • RAID 巡逻扫描可能会发现不可读的扇区(请注意total uncorrected errors第一个磁盘上有多少个扇区)并且相同扇区的重写成功 - 因此 RAID 阵列记录了介质错误但磁盘不会重新映射该扇区(我认为磁盘没有重新映射此类扇区是有缺陷的,但我在野外看到了它们)。
