smartctl
我们使用命令进行了一些测试数据节点物理服务器,
结果如下(来自一个磁盘的示例- sdd
),如下所示-SMART Health Status: OK
但在下Total uncorrected errors
,我们可以看到4
读取行
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 179459994 2 0 179459994 3 121159.886 4
我们应该从上述错误中了解到什么?关于磁盘健康状况?,这是否与磁盘故障有关?
注意 - 从内核消息中我们没有发现有关 sdd 磁盘的任何错误
smartctl 的完整输出
smartctl -a /dev/sdd
smartctl 6.5 2016-05-07 r4318 [x86_64-linux-3.10.0-957.el7.x86_64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Vendor: SEAGATE
Product: ST2000NX0433
Revision: NS02
Compliance: SPC-4
User Capacity: 2,000,398,934,016 bytes [2.00 TB]
Logical block size: 512 bytes
Formatted with type 2 protection
LU is fully provisioned
Rotation Rate: 7200 rpm
Form Factor: 2.5 inches
Logical Unit id: 0x5000c5009ead9b67
Serial number: W4605ZJS
Device type: disk
Transport protocol: SAS (SPL-3)
Local Time is: Sun Apr 10 07:43:13 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
Temperature Warning: Disabled or Not Supported
=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK
Current Drive Temperature: 26 C
Drive Trip Temperature: 60 C
Manufactured in week 06 of year 2017
Specified cycle count over device lifetime: 10000
Accumulated start-stop cycles: 67
Specified load-unload count over device lifetime: 300000
Accumulated load-unload cycles: 1814
Elements in grown defect list: 0
Vendor (Seagate) cache information
Blocks sent to initiator = 1875105375
Blocks received from initiator = 187534699
Blocks read from cache and sent to initiator = 190120229
Number of read and write commands whose size <= segment size = 259502723
Number of read and write commands whose size > segment size = 0
Vendor (Seagate/Hitachi) factory information
number of hours powered up = 42308.43
number of minutes until next internal SMART test = 44
Error counter log:
Errors Corrected by Total Correction Gigabytes Total
ECC rereads/ errors algorithm processed uncorrected
fast | delayed rewrites corrected invocations [10^9 bytes] errors
read: 179459994 2 0 179459994 3 121159.886 4
write: 0 0 6 6 6 120741.496 0
verify: 2979425514 0 0 2979425514 0 18284.914 0
Non-medium error count: 465
SMART Self-test log
Num Test Status segment LifeTime LBA_first_err [SK ASC ASQ]
Description number (hours)
# 1 Background long Completed 96 13 - [- - -]
# 2 Background short Completed 96 6 - [- - -]
Long (extended) Self Test duration: 20400 seconds [340.0 minutes]
答案1
至于如何解释待处理和重新分配的扇区,我最近写了以下帖子:
这个驱动器在 SMART 中没有出现任何故障迹象,是不是已经损坏了?
请将“增长缺陷列表中的元素”视为与我关于待处理和重新分配扇区的陈述等同的东西。
答案2
要查看的最重要的参数是读取(和写入)“总未校正错误”和“生长缺陷列表中的元素”。
总未纠正错误指定发生未纠正数据错误的块总数。
如果磁盘固件已成功将从坏扇区恢复的数据重新分配给备用物理扇区,则原始坏扇区将取消映射并放置在增长缺陷列表中。
在您的情况下,“未校正错误总数”为 4,而“生长缺陷列表中的元素”为 0。
这意味着四个扇区出现严重故障,固件无法重新映射它们并用备用扇区替换它们(大多数磁盘都有数千个这样的备用扇区)。
虽然四个扇区不多,但磁盘可能出现故障。是否更换磁盘由您决定。如果您决定保留它,请确保您已备份其所有数据。
换句话说,只要“未纠正错误总数”保持为 4 且“已纠正错误总数”保持为 0(这意味着“增长缺陷列表中的元素”为 0),磁盘的状态就是稳定的,您可以继续使用它。如果这两个数字中的任何一个开始增加,这是一个很大的危险信号。没有必要每天检查这些参数,但要时不时地检查一下。