我的 raid1 时不时会进入降级状态。然后我的应用程序崩溃,因为 raid 处于只读模式。重新启动后,raid 又正常工作了。现在我想找出这个错误的根本原因。也许有人能给我一些提示,让我可以开始寻找。
这是重启后的状态,几天内都可以正常工作
root@node:~# sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue May 17 21:43:06 2022
Raid Level : raid1
Array Size : 1953382464 (1862.89 GiB 2000.26 GB)
Used Dev Size : 1953382464 (1862.89 GiB 2000.26 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Jun 30 11:05:30 2022
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : node:0 (local to host node)
UUID : 449cfe85:fb2d3888:83ff4d80:3b4b007d
Events : 26471
Number Major Minor RaidDevice State
0 8 0 0 active sync /dev/sda
1 8 16 1 active sync /dev/sdb
这是“未知”事件发生后的状态
root@node:/var/log# sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue May 17 21:43:06 2022
Raid Level : raid1
Array Size : 1953382464 (1862.89 GiB 2000.26 GB)
Used Dev Size : 1953382464 (1862.89 GiB 2000.26 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Thu Jun 30 06:15:29 2022
State : clean, degraded
Active Devices : 1
Working Devices : 1
Failed Devices : 1
Spare Devices : 0
Consistency Policy : bitmap
Number Major Minor RaidDevice State
- 0 0 0 removed
1 8 16 1 active sync /dev/sdb
0 8 0 - faulty /dev/sda
有时是 sdb 发生故障,有时是 sda 发生故障。发生这种情况没有规律,或者两个驱动器中的一个主要是故障驱动器。SSD 是全新的,我从一开始就有这种行为。正如我所写,重新启动后,RAID 又恢复正常了。
/etc/mdadm/mdadm.conf
# automatically tag new arrays as belonging to the local system
HOMEHOST <system>
# instruct the monitoring daemon where to send mail alerts
MAILADDR [email protected]
MAILFROM [email protected]
# definitions of existing MD arrays
# This configuration was auto-generated on Thu, 21 Apr 2022 01:01:03 +0000 by mkconf
ARRAY /dev/md0 level=raid1 num-devices=2 metadata=1.2 spares=0 name=node:0 UUID=449cfe85:fb2d3888:83ff4d80:3b4b007d
devices=/dev/sda,/dev/sdb
如果无法找出问题所在,是否有设置可以防止 raid 切换到只读模式?我以为 raid 是一种高可用性解决方案,但如果两个设备中的一个出现问题,我的应用程序就会崩溃,因为它们无法将文件放到磁盘上。
系统:Ubuntu 22.04 LTS Raid1 --> 2x Samsung 870 EVO 2.5 英寸 SSD - 2TB
cat /var/log/kern.log | grep md0
Jun 30 04:03:04 node kernel: [ 4.970441] md/raid1:md0: not clean -- starting background reconstruction
Jun 30 04:03:04 node kernel: [ 4.970446] md/raid1:md0: active with 2 out of 2 mirrors
Jun 30 04:03:04 node kernel: [ 4.974972] md0: detected capacity change from 0 to 3906764928
Jun 30 04:03:04 node kernel: [ 4.975043] md: resync of RAID array md0
Jun 30 04:03:04 node kernel: [ 9.763722] EXT4-fs (md0): recovery complete
Jun 30 04:03:04 node kernel: [ 9.768258] EXT4-fs (md0): mounted filesystem with ordered data mode. Opts: (null). Quota mode: none.
Jun 30 04:03:04 node kernel: [ 12.678657] md: md0: resync done.
Jun 30 06:14:53 node kernel: [ 7927.757074] md/raid1:md0: Disk failure on sda, disabling device.
Jun 30 06:14:53 node kernel: [ 7927.757074] md/raid1:md0: Operation continuing on 1 devices.
Jun 30 06:15:28 node kernel: [ 7962.903309] EXT4-fs warning (device md0): ext4_end_bio:342: I/O error 10 writing to inode 80478214 starting block 154449626)
Jun 30 06:15:28 node kernel: [ 7962.903312] Buffer I/O error on device md0, logical block 154449626
Jun 30 06:15:28 node kernel: [ 7962.903319] Buffer I/O error on dev md0, logical block 471859204, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903323] Buffer I/O error on dev md0, logical block 450888194, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903327] Buffer I/O error on dev md0, logical block 284164106, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903329] Buffer I/O error on dev md0, logical block 284164105, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903331] Buffer I/O error on dev md0, logical block 284164104, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903333] Buffer I/O error on dev md0, logical block 284164103, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903335] Buffer I/O error on dev md0, logical block 284164102, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903336] Buffer I/O error on dev md0, logical block 284164101, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903338] Buffer I/O error on dev md0, logical block 284164100, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903340] Buffer I/O error on dev md0, logical block 284164099, lost async page write
Jun 30 06:15:28 node kernel: [ 7962.903351] EXT4-fs warning (device md0): ext4_end_bio:342: I/O error 10 writing to inode 112728289 starting block 470803456)
Jun 30 06:15:28 node kernel: [ 7962.903352] Buffer I/O error on device md0, logical block 470803456
Jun 30 06:15:28 node kernel: [ 7962.903356] EXT4-fs warning (device md0): ext4_end_bio:342: I/O error 10 writing to inode 112728306 starting block 283967488)
Jun 30 06:15:28 node kernel: [ 7962.903357] EXT4-fs error (device md0): ext4_check_bdev_write_error:217: comm kworker/u64:2: Error while async write back metadata
Jun 30 06:15:28 node kernel: [ 7962.903372] Buffer I/O error on device md0, logical block 283967488
Jun 30 06:15:28 node kernel: [ 7962.903376] EXT4-fs warning (device md0): ext4_end_bio:342: I/O error 10 writing to inode 112728732 starting block 154806925)
Jun 30 06:15:28 node kernel: [ 7962.903378] Buffer I/O error on device md0, logical block 154806925
Jun 30 06:15:28 node kernel: [ 7962.903378] Buffer I/O error on device md0, logical block 283967489
Jun 30 06:15:28 node kernel: [ 7962.903379] Buffer I/O error on device md0, logical block 283967490
Jun 30 06:15:28 node kernel: [ 7962.903382] Aborting journal on device md0-8.
Jun 30 06:15:28 node kernel: [ 7962.903382] Buffer I/O error on device md0, logical block 283967491
Jun 30 06:15:28 node kernel: [ 7962.903385] Buffer I/O error on device md0, logical block 283967492
Jun 30 06:15:28 node kernel: [ 7962.903386] Buffer I/O error on device md0, logical block 283967493
Jun 30 06:15:28 node kernel: [ 7962.903387] Buffer I/O error on device md0, logical block 283967494
Jun 30 06:15:28 node kernel: [ 7962.903390] EXT4-fs error (device md0) in ext4_reserve_inode_write:5726: Journal has aborted
Jun 30 06:15:28 node kernel: [ 7962.903395] EXT4-fs error (device md0) in ext4_reserve_inode_write:5726: Journal has aborted
Jun 30 06:15:28 node kernel: [ 7962.903395] EXT4-fs error (device md0): ext4_dirty_inode:5922: inode #80478237: comm lnd: mark_inode_dirty error
Jun 30 06:15:28 node kernel: [ 7962.903397] EXT4-fs error (device md0): ext4_journal_check_start:83: comm tor: Detected aborted journal
Jun 30 06:15:28 node kernel: [ 7962.903398] EXT4-fs error (device md0) in ext4_dirty_inode:5923: Journal has aborted
Jun 30 06:15:28 node kernel: [ 7962.903399] EXT4-fs error (device md0): ext4_dirty_inode:5922: inode #80478214: comm lnd: mark_inode_dirty error
Jun 30 06:15:28 node kernel: [ 7962.903403] EXT4-fs error (device md0) in ext4_reserve_inode_write:5726: Journal has aborted
Jun 30 06:15:28 node kernel: [ 7962.903406] EXT4-fs error (device md0) in ext4_dirty_inode:5923: Journal has aborted
Jun 30 06:15:28 node kernel: [ 7962.903407] EXT4-fs error (device md0): mpage_map_and_submit_extent:2497: inode #80478214: comm kworker/u64:2: mark_inode_dirty error
Jun 30 06:15:28 node kernel: [ 7962.908521] EXT4-fs warning (device md0): ext4_end_bio:342: I/O error 10 writing to inode 80478214 starting block 154449627)
Jun 30 06:15:28 node kernel: [ 7962.908525] EXT4-fs (md0): I/O error while writing superblock
Jun 30 06:15:28 node kernel: [ 7962.908531] JBD2: Error -5 detected when updating journal superblock for md0-8.
Jun 30 06:15:28 node kernel: [ 7962.908542] EXT4-fs (md0): I/O error while writing superblock
Jun 30 06:15:28 node kernel: [ 7962.908544] EXT4-fs (md0): Remounting filesystem read-only
Jun 30 06:15:28 node kernel: [ 7962.908545] EXT4-fs (md0): failed to convert unwritten extents to written extents -- potential data loss! (inode 80478214, error -30)
Jun 30 06:15:28 node kernel: [ 7962.908550] EXT4-fs (md0): I/O error while writing superblock
Jun 30 06:15:28 node kernel: [ 7962.908560] EXT4-fs (md0): I/O error while writing superblock
Jun 30 06:32:13 node kernel: [ 5.076652] md/raid1:md0: not clean -- starting background reconstruction
Jun 30 06:32:13 node kernel: [ 5.076658] md/raid1:md0: active with 2 out of 2 mirrors
Jun 30 06:32:13 node kernel: [ 5.081202] md0: detected capacity change from 0 to 3906764928
Jun 30 06:32:13 node kernel: [ 5.081262] md: resync of RAID array md0
Jun 30 06:32:13 node kernel: [ 8.971854] EXT4-fs (md0): recovery complete
经过一些 SMART 和 Badblock 扫描后,我发现其中一个设备存在块错误:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 099 099 010 Pre-fail Always - 6
9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 1123
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 13
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 8
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 099 099 010 Pre-fail Always - 6
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 099 099 010 Pre-fail Always - 6
187 Reported_Uncorrect 0x0032 099 099 000 Old_age Always - 278
190 Airflow_Temperature_Cel 0x0032 054 035 000 Old_age Always - 46
195 Hardware_ECC_Recovered 0x001a 199 199 000 Old_age Always - 278
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 6
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 31580059046
SMART Error Log Version: 1
ATA Error Count: 278 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.
Error 278 occurred at disk power-on lifetime: 1122 hours (46 days + 18 hours)
When the command that caused the error occurred, the device was active or idle.
After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 38 40 19 04 40 Error: UNC at LBA = 0x00041940 = 268608
Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name
-- -- -- -- -- -- -- -- ---------------- --------------------
60 08 38 40 19 04 40 07 43d+18:19:46.250 READ FPDMA QUEUED
61 08 28 10 00 00 40 05 43d+18:19:46.250 WRITE FPDMA QUEUED
47 00 01 30 06 00 40 04 43d+18:19:46.250 READ LOG DMA EXT
47 00 01 30 00 00 40 04 43d+18:19:46.250 READ LOG DMA EXT
47 00 01 00 00 00 40 04 43d+18:19:46.250 READ LOG DMA EXT
和 badblockscan
root@node:/var/log# sudo badblocks -sv /dev/sda
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test): done
Pass completed, 0 bad blocks found. (0/0/0 errors)
root@node:/var/log# sudo badblocks -sv /dev/sdb
Checking blocks 0 to 1953514583
Checking for bad blocks (read-only test): 1063390992ne, 44:44 elapsed. (0/0/0 errors)
1063390993
1063390994ne, 44:45 elapsed. (2/0/0 errors)
1063390995
1063391056ne, 44:47 elapsed. (4/0/0 errors)
1063391057
1063391058
1063391059
1063395472ne, 44:48 elapsed. (8/0/0 errors)
1063397200ne, 44:49 elapsed. (9/0/0 errors)
1063397201ne, 44:50 elapsed. (10/0/0 errors)
...
更换 raid1 中的磁盘的最佳程序是什么?我可以将磁盘数减少到 1
mdadm --grow /dev/md0 --raid-devices=1 --force
,然后更换故障磁盘并将其放回 raid,
mdadm --grow /dev/md0 --raid-devices=2 --add /dev/sdb
但这是正确的方法吗?
答案1
一个驱动器损坏不足以导致 MD RAID 设备上出现 I/O 错误。它应该绝不显示 I/O 错误,即使某些组件设备发生故障,这也是我们使用它的原因。因此,检查 SMART两个都设备并检查 RAM。
这不是正确的方法。您不需要使用增长模式。
如果坏磁盘本身没有设置故障标志 ( ),则需要将其故障化,mdadm -f /dev/mdX /dev/sdYZ
然后将其移除 ( mdadm -r /dev/mdX /dev/sdYZ
)。
当您有了新磁盘时,请根据需要对其进行分区并添加到阵列(mdadm --add /dev/mdX /dev/sdYZ
)。同步将自动开始。您可以使用 来查看同步情况cat /proc/mdstat
。默认情况下,同步速度上限为 200 MB/s,您可以通过将所需的 KB/s 值写入 来解除此限制/sys/block/mdX/md/sync_speed_max
。
不要忘记在新驱动器上安装引导加载程序。
使用一些自动化工具监控 SMART。监控 RAID。每月检查您的 RAID:echo check > /sys/block/mdX/md/sync_action
;Debian 自动执行此操作。
您绝对应该做所有这些。不要以为 RAID 会在没有监控的情况下拯救您。