我们的两台服务器都受到
mdstat mismatch cnt unsynchronized blocks
每个月初我们都会遇到此错误,我们必须使用以下方法修复突袭
echo 'repair' >/sys/block/<md id>/md/sync_action
如果我没记错的话,这个检查是由 mdcheck_start.timer.service 引起的。
修复它大约需要 5 个小时,之后它会自行修复,至少我是这么认为的。
问题是,这是修复 raid 不同步块的正确方法吗?是什么原因造成的?我如何判断这是硬件/磁盘错误?谢谢!
编辑:/etc/fstab 包含:
# /etc/fstab: static file system information.
# / was on /dev/md2p1 during curtin installation
/dev/disk/by-id/md-uuid-b0b68adb:353b70e8:fa806910:a78761e9-part1 / ext4 defaults 0 0
# /vol/data was on /dev/md3p1 during curtin installation
/dev/disk/by-id/md-uuid-2360fc63:991922f4:33aae17f:12f23590-part1 /vol/data ext4 defaults 0 0
# /boot was on /dev/md0p1 during curtin installation
/dev/disk/by-id/md-uuid-a76428ff:270597e7:70ed6c91:026d2441-part1 /boot ext4 defaults 0 0
UUID="5c389b41-007d-4893-b81c-5560cb2d6ff9" /vol/backup ext4 defaults 0 0
172.30.0.199:/vol/shared /vol/shared nfs defaults 0 0
输出lsblk --discard
:
NAME DISC-ALN DISC-GRAN DISC-MAX DISC-ZERO
loop0 0 4K 4G 0
loop1 0 4K 4G 0
loop2 0 4K 4G 0
loop3 0 4K 4G 0
loop4 0 4K 4G 0
loop5 0 4K 4G 0
loop6 0 4K 4G 0
loop7 0 4K 4G 0
loop8 0 4K 4G 0
sda 0 4K 2G 0
├─sda1 0 4K 2G 0
├─sda2 0 4K 2G 0
│ └─md0 0 4K 2G 0
│ └─md0p1 0 4K 2G 0
├─sda3 0 4K 2G 0
│ └─md1 0 4K 2G 0
│ └─md1p1 0 4K 2G 0
└─sda4 0 4K 2G 0
└─md2 0 4K 2G 0
└─md2p1 0 4K 2G 0
sdb 0 4K 2G 0
├─sdb1 0 4K 2G 0
├─sdb2 0 4K 2G 0
│ └─md0 0 4K 2G 0
│ └─md0p1 0 4K 2G 0
├─sdb3 0 4K 2G 0
│ └─md1 0 4K 2G 0
│ └─md1p1 0 4K 2G 0
└─sdb4 0 4K 2G 0
└─md2 0 4K 2G 0
└─md2p1 0 4K 2G 0
sdc 0 0B 0B 0
└─sdc1 0 0B 0B 0
nvme1n1 0 512B 2T 0
└─md3 0 512B 2T 0
└─md3p1 0 512B 2T 0
nvme0n1 0 512B 2T 0
└─md3 0 512B 2T 0
└─md3p1 0 512B 2T 0
输出smartctl -i /dev/sd[ab]
:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-92-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Intel S4510/S4610/S4500/S4600 Series SSDs
Device Model: INTEL SSDSC2KG960G8
Serial Number: BTYG024601ZC960CGN
LU WWN Device Id: 5 5cd2e4 152b3fddf
Firmware Version: XCV10120
User Capacity: 960,197,124,096 bytes [960 GB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.2, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Feb 2 07:43:15 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
输出mdadm --detail /dev/md2
:
/dev/md2:
Version : 1.2
Creation Time : Tue Nov 24 21:02:34 2020
Raid Level : raid1
Array Size : 919731200 (877.12 GiB 941.80 GB)
Used Dev Size : 919731200 (877.12 GiB 941.80 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Wed Feb 2 07:43:33 2022
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Name : ubuntu-server:2
UUID : b0b68adb:353b70e8:fa806910:a78761e9
Events : 24281
Number Major Minor RaidDevice State
0 8 4 0 active sync /dev/sda4
1 8 20 1 active sync /dev/sdb4
输出smartctl -A -l error /dev/sda
:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-92-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 10469
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 7
175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2591 (8 65535)
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error_Count 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Drive_Temperature 0x0022 079 075 000 Old_age Always - 21 (Min/Max 12/27)
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 21
197 Pending_Sector_Count 0x0012 100 100 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1006057
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 419
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 52
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 628023
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
234 Thermal_Throttle_Status 0x0032 100 100 000 Old_age Always - 0/0
235 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2591 (8 65535)
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1006057
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 1112548
243 NAND_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1730576
SMART Error Log Version: 1
No Errors Logged
输出smartctl -A -l error /dev/sdb
:
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-92-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0032 100 100 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 10469
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 8
170 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
174 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 7
175 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2479 (8 65535)
183 SATA_Downshift_Count 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error_Count 0x0033 100 100 090 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Drive_Temperature 0x0022 078 073 000 Old_age Always - 22 (Min/Max 12/29)
192 Unsafe_Shutdown_Count 0x0032 100 100 000 Old_age Always - 7
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 22
197 Pending_Sector_Count 0x0012 100 100 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
225 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1064411
226 Workld_Media_Wear_Indic 0x0032 100 100 000 Old_age Always - 440
227 Workld_Host_Reads_Perc 0x0032 100 100 000 Old_age Always - 45
228 Workload_Minutes 0x0032 100 100 000 Old_age Always - 628005
232 Available_Reservd_Space 0x0033 100 100 010 Pre-fail Always - 0
233 Media_Wearout_Indicator 0x0032 100 100 000 Old_age Always - 0
234 Thermal_Throttle_Status 0x0032 100 100 000 Old_age Always - 0/0
235 Power_Loss_Cap_Test 0x0033 100 100 010 Pre-fail Always - 2479 (8 65535)
241 Host_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1064411
242 Host_Reads_32MiB 0x0032 100 100 000 Old_age Always - 876800
243 NAND_Writes_32MiB 0x0032 100 100 000 Old_age Always - 1801020
SMART Error Log Version: 1
No Errors Logged