我在 RHEL 上有一个软件 RAID 1 阵列。每天早上我都会收到这个错误电子邮件:设备:/dev/sda [SAT],1 个当前无法读取(待处理)的扇区
当我在 sda(或 sdb)上运行测试时,一切似乎都通过了。我是否遗漏了什么?
mdstat 显示 RAID 阵列正常:
# cat /proc/mdstat
Personalities : [raid1]
md5 : active raid1 sdb5[1] sda5[0]
108026816 blocks [2/2] [UU]
md1 : active raid1 sdb1[1] sda1[0]
511936 blocks [2/2] [UU]
md2 : active raid1 sda2[0] sdb2[1]
805306304 blocks [2/2] [UU]
md3 : active raid1 sda3[0] sdb3[1]
62914496 blocks [2/2] [UU]
unused devices: <none>
以下是输出:# smartctl -q noserial -a /dev/sda
我也尝试运行:# smartctl -t long /dev/sda
smartctl 5.43 2012-06-30 r3573 [x86_64-linux-2.6.32-279.9.1.el6.x86_64] (local build)
Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Hitachi Ultrastar A7K1000
Device Model: Hitachi HUA721010KLA330
Firmware Version: GKAOAB0A
User Capacity: 1,000,204,886,016 bytes [1.00 TB]
Sector Size: 512 bytes logical/physical
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 1
Local Time is: Sun May 21 17:51:42 2017 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x84) Offline data collection activity
was suspended by an interrupting command from host.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (15354) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 255) minutes.
SCT capabilities: (0x003f) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 098 098 016 Pre-fail Always - 4
2 Throughput_Performance 0x0005 100 100 054 Pre-fail Offline - 0
3 Spin_Up_Time 0x0007 122 122 024 Pre-fail Always - 550 (Average 591)
4 Start_Stop_Count 0x0012 100 100 000 Old_age Always - 68
5 Reallocated_Sector_Ct 0x0033 100 100 005 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 067 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 020 Pre-fail Offline - 0
9 Power_On_Hours 0x0012 094 094 000 Old_age Always - 43202
10 Spin_Retry_Count 0x0013 100 100 060 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 68
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 751
193 Load_Cycle_Count 0x0012 100 100 000 Old_age Always - 751
194 Temperature_Celsius 0x0002 090 090 000 Old_age Always - 66 (Min/Max 17/72)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0022 100 100 000 Old_age Always - 1
198 Offline_Uncorrectable 0x0008 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x000a 200 200 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 43186 -
# 2 Extended offline Completed without error 00% 43170 -
# 3 Short offline Completed without error 00% 43162 -
# 4 Short offline Completed without error 00% 43138 -
# 5 Short offline Completed without error 00% 43114 -
# 6 Short offline Completed without error 00% 43090 -
# 7 Short offline Completed without error 00% 43066 -
# 8 Short offline Completed without error 00% 43042 -
# 9 Extended offline Completed without error 00% 43031 -
#10 Short offline Completed without error 00% 43024 -
#11 Short offline Completed without error 00% 43018 -
#12 Extended offline Completed without error 00% 43002 -
#13 Short offline Completed without error 00% 42994 -
#14 Short offline Completed without error 00% 42970 -
#15 Short offline Completed without error 00% 42946 -
#16 Short offline Completed without error 00% 42922 -
#17 Short offline Completed without error 00% 42898 -
#18 Short offline Completed without error 00% 42874 -
#19 Short offline Completed without error 00% 42850 -
#20 Extended offline Completed without error 00% 42833 -
#21 Short offline Completed without error 00% 42826 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
答案1
您的 SMART 当前待处理扇区的值为 1。这意味着磁盘上有一个坏扇区并且驱动器固件无法重新分配它,但是您的重新分配扇区数仍然为零,因此即使您的驱动器已在不太健康的环境中运行了 5 年,它也可能是可以恢复的 - 温度高达 72 C°。
您可以尝试找到这个坏扇区,dd if=/dev/sda of=/dev/null
然后需要通过覆盖它来重新映射该扇区。
答案2
我认为“197 Current_Pending_Sector”计数是读取失败的扇区数。这确实暗示驱动器开始出现故障,但并不一定意味着驱动器有问题。如果重写这些扇区,驱动器固件可以重新映射它们,驱动器就可以正常工作。
搜索后还发现,有讨论表明,某些型号的 SSD 驱动器会定期将其切换到 1 并转回,这可能是 SMART 报告中一个几乎无害的固件错误。
因此,您可以忽略它,只要您的文件系统可以处理偶尔出现的坏块读取(即在后台执行某种强大的 raid/冗余),它可能会随着文件系统覆盖这些块而慢慢清除。如果您的文件系统无法处理坏块读取,则每次尝试读取该块中的任何内容时,您都可能会收到 IO 错误。您仍然可以通过查找和删除该文件来恢复,并且文件系统最终将重写该扇区。
您还可以通过明确覆盖这些扇区来清除 Current_Pending_Sector 计数。这将破坏磁盘上的数据!这可能会损坏文件系统,从而导致磁盘上的所有数据丢失,而不仅仅是坏扇区中丢失的数据。因此,只有当您能够承受丢失磁盘上的所有数据时才这样做。
您可以通过运行智能长测试来找到坏扇区;
# smartctl -t long /dev/sda
然后,您可以检查长测试的状态以查看其是否完成,以及遇到的第一个错误的 LBA;
# smartctl -l selftest /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.7.0-1-amd64] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 17711 12345
# 2 Short offline Completed without error 00% 17709 -
# 3 Short captive Interrupted (host reset) 10% 450 -
# 4 Short captive Interrupted (host reset) 10% 228 -
注意:如果您没有看到数字,
LBA_of_first_error
您可以尝试使用-l xselftest
。
然后您可以使用 dd 覆盖坏扇区;
# dd if=/dev/zero of=/dev/sda seek=12345 count=1
但请注意,默认的 bs=512 块大小可能小于驱动器上的物理扇区大小,因此您可能需要写入 count=8 才能完全擦除。然后您可以冲洗并重复测试/覆盖循环,直到所有坏扇区都被覆盖。最后,您可以检查计数是否已归零;
$ sudo smartctl -A /dev/sdc | grep Current_Pending_Sector
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
我在旧的 WD 200G HDD 上遇到过这种情况,它出现了某种故障,可能是由于热插拔连接不良引起的,导致 Current_Pending_Sector 计数为 26。当我运行长测试时,它通过了,没有发现任何坏扇区,但计数仍然为 26。我能够通过使用 dd 将整个驱动器清零来将计数器清零;
dd if=/dev/zero of=/dev/sda bs=1M status=progress
请注意,bs=1M 使其运行速度更快,但这不是 512 扇区大小,因此最后会有部分块。后续的 smartctl 长短测试都报告驱动器正常。
答案3
这仅表示您的 RAID 阵列中的一个驱动器上有一个坏扇区。目前无需担心,除非该磁盘上开始出现更多坏扇区。您不应尝试手动修复错误……这将在每个月的第一天通过 raid-check 命令自动完成,该命令从 /etc/cron.d/raid-check 运行。您可以检查该命令并手动运行它以立即重新分配磁盘上的坏扇区:
[root@server]# more /etc/cron.d/raid-check
0 1 * * Sun root /usr/sbin/raid-check
[root@server]# /usr/sbin/raid-check
这将强制 mdraid 从 RAID 阵列中的另一个磁盘复制坏块,并将坏块标记为不可用。
答案4
现在你有当前待处理行业在sda
硬盘上。此磁盘用于软件 RAID。有一个死机启动。您需要更换此磁盘。
此磁盘专为台式电脑设计。它会尝试自行恢复(而不是通过 RAID)。它会每隔几秒读取一次坏扇区,直到坏扇区的校验和为正确(磁盘性能将急剧下降),然后磁盘会将读取的数据写入新的保留扇区。但是从恢复的扇区读取的数据可能经常是错误的。CRC32 校验和只能指示第一个错误,并且它不适用于恢复数据(例如 - RAID-5 上的 XOR 可用于恢复数据)。当 RAID 从坏驱动器读取这些数据时,它将提供不同的数据,这些数据将被读取。如果坏块包含重要的系统数据,这种情况会导致系统内核崩溃。这就是您需要更换坏磁盘的原因。