我有一个设置了多个硬盘的 LVM 卷,其中一个似乎出现故障,或者至少出现了一些奇怪的情况。每次逻辑卷series
出现大量写入活动时,正在运行的程序(大多数情况下是 rTorrent)就会崩溃,并dmesg
报告
ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozen
ata6.00: irq_stat 0x00400000, PHY RDY changed
ata6: SError: { PHYRdyChg LinkSeq TrStaTrns }
ata6.00: failed command: FLUSH CACHE EXT
ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
res 40/00:2c:ff:e3:e3/00:00:39:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
end_request: I/O error, dev sdf, sector 0
ata6: EH complete
I/O error in filesystem ("dm-3") meta-data dev dm-3 block 0x640092a ("xlog_iodone") error 5 buf count 32768
xfs_force_shutdown(dm-3,0x2) called from line 1043 of file fs/xfs/xfs_log.c. Return address = 0xffffffff8119b919
Filesystem "dm-3": Log I/O Error Detected. Shutting down filesystem: dm-3
Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(dm-3,0x2) called from line 811 of file fs/xfs/xfs_log.c. Return address = 0xffffffff8119ccfb
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
... and so on
卷本身:
--- Logical volume ---
LV Name /dev/storage/series
VG Name storage
LV UUID sF6I3A-Ttt5-PEml-BY5i-edOV-43ha-5P75Z3
LV Write Access read/write
LV Status available
# open 1
LV Size 2.86 TiB
Current LE 748800
Segments 29
Allocation inherit
Read ahead sectors auto
- currently set to 256
Block device 253:3
然后我检查umount
所有的 LVM 卷,并尝试xfs_check
在一个卷上运行(所有逻辑卷都使用 XFS)。它说
错误:文件系统的日志中有需要重播的宝贵元数据更改。请挂载文件系统以重播日志,并在重新运行 xfs_check 之前将其卸载。如果您无法挂载文件系统,请使用 xfs_repair -L 选项销毁日志并尝试修复。请注意,销毁日志可能会导致损坏 - 请在执行此操作之前尝试挂载文件系统。
所以我继续运行mount
,运行正常,然后unmount
我再次运行检查。它会运行一段时间,直到因占用过多内存而被终止。
# xfs_check /dev/storage/series
/usr/sbin/xfs_check: line 31: 14350 Killed
xfs_db$DBOPTS -F -i -p xfs_check -c "check$OPTS" $1
dmesg 然后报告
xfs_db invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
xfs_db cpuset=/ mems_allowed=0
Pid: 14350, comm: xfs_db Tainted: P 2.6.32-gentoo-r7 #1
Call Trace:
[<ffffffff81067aec>] ? 0xffffffff81067aec
[<ffffffff8107a848>] 0xffffffff8107a848
[<ffffffff8104ee2c>] ? 0xffffffff8104ee2c
[<ffffffff8107ac83>] 0xffffffff8107ac83
[<ffffffff8107adf1>] 0xffffffff8107adf1
[<ffffffff8107d460>] 0xffffffff8107d460
[<ffffffff8129d69e>] ? 0xffffffff8129d69e
[<ffffffff8108a40d>] 0xffffffff8108a40d
[<ffffffff8108bd67>] 0xffffffff8108bd67
[<ffffffff810258ff>] 0xffffffff810258ff
[<ffffffff8140290f>] 0xffffffff8140290f
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
CPU 1: hi: 0, btch: 1 usd: 0
DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 103
CPU 1: hi: 186, btch: 31 usd: 177
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 35
CPU 1: hi: 186, btch: 31 usd: 155
active_anon:717606 inactive_anon:271926 isolated_anon:0
active_file:155 inactive_file:217 isolated_file:0
unevictable:0 dirty:0 writeback:48 unstable:0
free:6959 slab_reclaimable:1102 slab_unreclaimable:4133
mapped:156 shmem:0 pagetables:3644 bounce:0
DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15272kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2999 4009 4009
DMA32 free:10020kB min:6052kB low:7564kB high:9076kB active_anon:2377112kB inactive_anon:594248kB active_file:252kB inactive_file:268kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3071904kB mlocked:0kB dirty:0kB writeback:16kB mapped:196kB shmem:0kB slab_reclaimable:1620kB slab_unreclaimable:3980kB kernel_stack:56kB pagetables:3636kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:800 all_unreclaimable? yes
lowmem_reserve[]: 0 0 1010 1010
Normal free:1928kB min:2036kB low:2544kB high:3052kB active_anon:493312kB inactive_anon:493456kB active_file:368kB inactive_file:600kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:176kB mapped:428kB shmem:0kB slab_reclaimable:2788kB slab_unreclaimable:12552kB kernel_stack:1008kB pagetables:10940kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2872 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 3*16kB 3*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
DMA32: 459*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 10020kB
Normal: 482*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1928kB
2990 total pagecache pages
2626 pages in swap cache
Swap cache stats: add 129611, delete 126985, find 334/869
Free swap = 0kB
Total swap = 498004kB
1048560 pages RAM
34218 pages reserved
1846 pages shared
1006066 pages non-shared
Out of memory: kill process 14350 (xfs_db) score 105765 or a child
Killed process 14350 (xfs_db)
内存问题很可能无关,但我不知道为什么xfs_check
需要那么多。
smartctl
关于这次驾驶,我有这样的看法:
# smartctl -a /dev/sdf
smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Caviar Blue Serial ATA family
Device Model: WDC WD5000AAKS-00YGA0
Serial Number: WD-WCAS80682099
Firmware Version: 12.01C02
User Capacity: 500,107,862,016 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Tue May 17 23:17:17 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (13200) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 154) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x303f) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 226 181 021 Pre-fail Always - 3675
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 33
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000e 200 200 051 Old_age Always - 0
9 Power_On_Hours 0x0032 061 061 000 Old_age Always - 28688
10 Spin_Retry_Count 0x0012 100 253 051 Old_age Always - 0
11 Calibration_Retry_Count 0x0012 100 253 051 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 32
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 19
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 35
194 Temperature_Celsius 0x0022 112 095 000 Old_age Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 200 200 051 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 28541 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
SMART 似乎认为没有什么大问题,但显然有些事情正在发生。不幸的是,我不知道现在应该尝试什么。我想避免更换电缆或更换驱动器,直到我确定有必要为止,但欢迎提出任何建议。
更新
按照@Zoredache 的建议,我badblocks
在驱动器上运行了。
# badblocks -s /dev/sdf
Checking for bad blocks (read-only test): done
据我所知,这应该输出坏块列表,这意味着它没有发现任何……