LVM 中的硬盘有问题 — 它坏了吗?

LVM 中的硬盘有问题 — 它坏了吗?

我有一个设置了多个硬盘的 LVM 卷,其中一个似乎出现故障,或者至少出现了一些奇怪的情况。每次逻辑卷series出现大量写入活动时,正在运行的程序(大多数情况下是 rTorrent)就会崩溃,并dmesg报告

ata6.00: exception Emask 0x10 SAct 0x0 SErr 0x1810000 action 0xe frozen
ata6.00: irq_stat 0x00400000, PHY RDY changed
ata6: SError: { PHYRdyChg LinkSeq TrStaTrns }
ata6.00: failed command: FLUSH CACHE EXT
ata6.00: cmd ea/00:00:00:00:00/00:00:00:00:00/a0 tag 0
         res 40/00:2c:ff:e3:e3/00:00:39:00:00/40 Emask 0x10 (ATA bus error)
ata6.00: status: { DRDY }
ata6: hard resetting link
ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata6.00: configured for UDMA/133
end_request: I/O error, dev sdf, sector 0
ata6: EH complete
I/O error in filesystem ("dm-3") meta-data dev dm-3 block 0x640092a       ("xlog_iodone") error 5 buf count 32768
xfs_force_shutdown(dm-3,0x2) called from line 1043 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff8119b919
Filesystem "dm-3": Log I/O Error Detected.  Shutting down filesystem: dm-3
Please umount the filesystem, and rectify the problem(s)
xfs_force_shutdown(dm-3,0x2) called from line 811 of file fs/xfs/xfs_log.c.  Return address = 0xffffffff8119ccfb
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
Filesystem "dm-3": xfs_log_force: error 5 returned.
... and so on

卷本身:

--- Logical volume ---
  LV Name                /dev/storage/series
  VG Name                storage
  LV UUID                sF6I3A-Ttt5-PEml-BY5i-edOV-43ha-5P75Z3
  LV Write Access        read/write
  LV Status              available
  # open                 1
  LV Size                2.86 TiB
  Current LE             748800
  Segments               29
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:3

然后我检查umount所有的 LVM 卷,并尝试xfs_check在一个卷上运行(所有逻辑卷都使用 XFS)。它说

错误:文件系统的日志中有需要重播的宝贵元数据更改。请挂载文件系统以重播日志,并在重新运行 xfs_check 之前将其卸载。如果您无法挂载文件系统,请使用 xfs_repair -L 选项销毁日志并尝试修复。请注意,销毁日志可能会导致损坏 - 请在执行此操作之前尝试挂载文件系统。

所以我继续运行mount,运行正常,然后unmount我再次运行检查。它会运行一段时间,直到因占用过多内存而被终止。

# xfs_check /dev/storage/series 
/usr/sbin/xfs_check: line 31: 14350 Killed 
                xfs_db$DBOPTS -F -i -p xfs_check -c "check$OPTS" $1

dmesg 然后报告

xfs_db invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0
xfs_db cpuset=/ mems_allowed=0
Pid: 14350, comm: xfs_db Tainted: P           2.6.32-gentoo-r7 #1
Call Trace:
 [<ffffffff81067aec>] ? 0xffffffff81067aec
 [<ffffffff8107a848>] 0xffffffff8107a848
 [<ffffffff8104ee2c>] ? 0xffffffff8104ee2c
 [<ffffffff8107ac83>] 0xffffffff8107ac83
 [<ffffffff8107adf1>] 0xffffffff8107adf1
 [<ffffffff8107d460>] 0xffffffff8107d460
 [<ffffffff8129d69e>] ? 0xffffffff8129d69e
 [<ffffffff8108a40d>] 0xffffffff8108a40d
 [<ffffffff8108bd67>] 0xffffffff8108bd67
 [<ffffffff810258ff>] 0xffffffff810258ff
 [<ffffffff8140290f>] 0xffffffff8140290f
Mem-Info:
DMA per-cpu:
CPU    0: hi:    0, btch:   1 usd:   0
CPU    1: hi:    0, btch:   1 usd:   0
DMA32 per-cpu:
CPU    0: hi:  186, btch:  31 usd: 103
CPU    1: hi:  186, btch:  31 usd: 177
Normal per-cpu:
CPU    0: hi:  186, btch:  31 usd:  35
CPU    1: hi:  186, btch:  31 usd: 155
active_anon:717606 inactive_anon:271926 isolated_anon:0
 active_file:155 inactive_file:217 isolated_file:0
 unevictable:0 dirty:0 writeback:48 unstable:0
 free:6959 slab_reclaimable:1102 slab_unreclaimable:4133
 mapped:156 shmem:0 pagetables:3644 bounce:0
DMA free:15888kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15272kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
lowmem_reserve[]: 0 2999 4009 4009
DMA32 free:10020kB min:6052kB low:7564kB high:9076kB active_anon:2377112kB inactive_anon:594248kB active_file:252kB inactive_file:268kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3071904kB mlocked:0kB dirty:0kB writeback:16kB mapped:196kB shmem:0kB slab_reclaimable:1620kB slab_unreclaimable:3980kB kernel_stack:56kB pagetables:3636kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:800 all_unreclaimable? yes
lowmem_reserve[]: 0 0 1010 1010
Normal free:1928kB min:2036kB low:2544kB high:3052kB active_anon:493312kB inactive_anon:493456kB active_file:368kB inactive_file:600kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1034240kB mlocked:0kB dirty:0kB writeback:176kB mapped:428kB shmem:0kB slab_reclaimable:2788kB slab_unreclaimable:12552kB kernel_stack:1008kB pagetables:10940kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2872 all_unreclaimable? yes
lowmem_reserve[]: 0 0 0 0
DMA: 0*4kB 0*8kB 3*16kB 3*32kB 2*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15888kB
DMA32: 459*4kB 1*8kB 1*16kB 1*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 1*4096kB = 10020kB
Normal: 482*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 1928kB
2990 total pagecache pages
2626 pages in swap cache
Swap cache stats: add 129611, delete 126985, find 334/869
Free swap  = 0kB
Total swap = 498004kB
1048560 pages RAM
34218 pages reserved
1846 pages shared
1006066 pages non-shared
Out of memory: kill process 14350 (xfs_db) score 105765 or a child
Killed process 14350 (xfs_db)

内存问题很可能无关,但我不知道为什么xfs_check需要那么多。

smartctl关于这次驾驶,我有这样的看法:

# smartctl -a /dev/sdf
smartctl 5.39.1 2010-01-28 r3054 [x86_64-pc-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Blue Serial ATA family
Device Model:     WDC WD5000AAKS-00YGA0
Serial Number:    WD-WCAS80682099
Firmware Version: 12.01C02
User Capacity:    500,107,862,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Tue May 17 23:17:17 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                 (13200) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 154) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x303f) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0003   226   181   021    Pre-fail  Always       -       3675
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   061   061   000    Old_age   Always       -       28688
 10 Spin_Retry_Count        0x0012   100   253   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       32
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       19
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       35
194 Temperature_Celsius     0x0022   112   095   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     28541         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SMART 似乎认为没有什么大问题,但显然有些事情正在发生。不幸的是,我不知道现在应该尝试什么。我想避免更换电缆或更换驱动器,直到我确定有必要为止,但欢迎提出任何建议。

更新

按照@Zoredache 的建议,我badblocks在驱动器上运行了。

# badblocks -s /dev/sdf
Checking for bad blocks (read-only test): done

据我所知,这应该输出坏块列表,这意味着它没有发现任何……

答案1

尝试关闭有问题的驱动器的 NCQ(参考:这一页这一页

echo 1 > /sys/block/sdX/device/queue_depth

您也可以尝试更换驱动器的 SATA 电缆,因为弱/边界电气连接也可能会导致此类错误。

至于运行 xfs_check 时遇到的内存问题;您只需要更多 RAM 和/或交换空间。这是一个相当大的文件系统,因此 xfs_check 需要大量内存并不令我感到惊讶。

相关内容