smartctl“增长缺陷列表中的元素”与 RAID 控制器“介质错误计数”

smartctl“增长缺陷列表中的元素”与 RAID 控制器“介质错误计数”

我在服务器中使用带有 PERC810 控制器的硬件 raid50,最近遇到了一个我不确定的指标。到目前为止,我一直使用 smartctl 指标“元素在增长缺陷列表中”作为驱动器出现故障并应被移除的提示,但如果我使用 perccli(或 storcli/megacli),驱动器还会显示一个名为“介质错误计数”的指标。我遇到的问题是,从我读到的这些指标来看,它们基本上是同一件事 - 都显示磁盘上的重新分配扇区或物理缺陷。但我的一些硬盘在元素在增长缺陷列表中显示的数字大于零,但在介质错误计数处显示零值,反之亦然。例如这个磁盘:

perccli /c0/e37/s7 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e37/s7 :
================

----------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
----------------------------------------------------------------------------
37:7     72 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
----------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


Drive /c0/e37/s7 - Detailed Information :
=======================================

Drive /c0/e37/s7 State :
======================
Shield Counter = 0
Media Error Count = 38
Other Error Count = 118063
Drive Temperature =  41C (105.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e37/s7 Device attributes :
==================================
SN = WMC1F0D41KD5
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01F55DD1
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01

显示Media Error Count = 3,但是当我对同一个磁盘使用 smartctl 时:

smartctl -a -d megaraid,72 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01f55dd0
Serial number:        WMC1F0D41KD5
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:14:51 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60298:10
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    2538437     9298     76289   2547735       9392     215124.761          94
write:   5550372  5405661   5407707  10956033    5405661     571404.363           0
verify:      184        0         0       184          0        352.277           0

Non-medium error count:   202249

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

表明Elements in grown defect list: 0

这是同一台服务器上的另一个示例,只是硬盘不同:

perccli /c0/e37/s4 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e37/s4 :
================

----------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
----------------------------------------------------------------------------
37:4     63 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
----------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


Drive /c0/e37/s4 - Detailed Information :
=======================================

Drive /c0/e37/s4 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 118060
Drive Temperature =  35C (95.00 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e37/s4 Device attributes :
==================================
SN = WMC1F0D222KF
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01352C35
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01 


Drive /c0/e37/s4 Policies/Settings :
==================================
Drive position = DriveGroup:1, Span:1, Row:0
Enclosure position = 0
Connected Port Number = 0(path0) 
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address        
-----------------------------------------
   0 Active 6.0Gb/s   0x50000c0f01352c36 
   1 Active Unknown   0x0                
-----------------------------------------


Inquiry Data = 
00 00 06 12 5b 01 10 02 57 44 20 20 20 20 20 20 
57 44 34 30 30 31 46 59 59 47 2d 30 31 53 4c 33 
56 52 30 38 57 44 2d 57 4d 43 31 46 30 44 32 32 
32 4b 46 20 20 20 20 20 00 00 00 a0 0c 40 20 c0 
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

其中Media Error Count = 0,但是smartctl:

smartctl -a -d megaraid,63 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01352c34
Serial number:        WMC1F0D222KF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:39:52 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60299:24
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 44

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    4899063        1         1   4899064          1     215489.217           0
write:   6593514      494       496   6594008        499     571584.348           0
verify:      345        0         0       345          0        349.197           0

Non-medium error count:   202287

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

演出Elements in grown defect list: 44

您能否解释一下这两个指标之间的区别以及在确定驱动器故障时应使用哪一个指标?谢谢。

答案1

造成这种差异的原因是,虽然两个指标测量的是类似的东西,但它们运作在不同的层面上。

Media Error Count测量介质错误从 RAID 卡上可以看到

Elements in grown defect list显示增长列表的大小,或重新映射扇区的数量从驱动器本身来看

两个值不匹配的原因有多种:

  • 当一个磁盘在另一个阵列或独立磁盘中积累了许多缺陷时,可以创建 RAID 阵列;
  • 磁盘背景表面扫描测试可以检测并重新映射任意数量的扇区,而不会让上层(即 RAID 卡)注意到;
  • 对有缺陷扇区的写入操作由磁盘本身“动态”重新映射,无需 RAID 卡的干预;
  • RAID 巡逻扫描可能会发现不可读的扇区(请注意total uncorrected errors第一个磁盘上有多少个扇区)并且相同扇区的重写成功 - 因此 RAID 阵列记录了介质错误但磁盘不会重新映射该扇区(我认为磁盘没有重新映射此类扇区是有缺陷的,但我在野外看到了它们)。

相关内容