我的系统中有一个 SSD,还有一个外部硬盘(通过 eSATA 连接)作为备份介质。我使用“rsync”将内部驱动器与外部驱动器同步。我的备份驱动器曾经是 Seagate ST1000LM024。问题是,驱动器上的文件系统总是损坏,要么文件系统变成只读,要么充满空目录,甚至超级块也损坏了。当我格式化驱动器时,有时我可以再次使用它,尽管它很快就会再次失败。最后,我甚至无法再格式化驱动器。
我认为驱动器有问题(即使它没有显示任何 SMART 错误)并且将其替换为更昂贵的 Western Digital WD10JPVX。
更换驱动器后,我可以使用新驱动器几个月,然后新驱动器也开始出现损坏。我再次格式化它并使其恢复工作状态。今天它又失败了。我格式化了它,但第一次写入新数据时失败了。所以我认为这可能是与“rsync”相关的问题,也许它正在并行写入大量数据,并且驱动器因此“出现故障”。所以我用普通的“cp”将数据复制到它。它运行良好了很长一段时间,然后又失败了。
在我看来,SMART 数据看起来并不太可疑。
# smartctl -a /dev/sdc
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.6-300.fc23.x86_64] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Western Digital Blue Mobile
Device Model: WDC WD10JPVX-22JC3T0
Serial Number: [removed]
LU WWN Device Id: [removed]
Firmware Version: 01.01A01
User Capacity: 1.000.204.886.016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2 (minor revision not indicated)
SATA Version is: SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Tue Apr 12 18:42:47 2016 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: (17640) seconds.
Offline data collection
capabilities: (0x7b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 198) minutes.
Conveyance self-test routine
recommended polling time: ( 5) minutes.
SCT capabilities: (0x7035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 180 174 021 Pre-fail Always - 1991
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 57
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 14
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 33
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 30
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 44
194 Temperature_Celsius 0x0022 109 094 000 Old_age Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 1
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
#
然而,dmesg
显示了很多这些...
[12609.913805] ata4.00: exception Emask 0x10 SAct 0x700000 SErr 0x400100 action 0x6 frozen
[12609.913817] ata4.00: irq_stat 0x08000000, interface fatal error
[12609.913824] ata4: SError: { UnrecovData Handshk }
[12609.913832] ata4.00: failed command: WRITE FPDMA QUEUED
[12609.913843] ata4.00: cmd 61/00:a0:00:29:ad/08:00:09:00:00/40 tag 20 ncq 1048576 out
res 40/00:b4:00:60:ad/00:00:09:00:00/40 Emask 0x10 (ATA bus error)
[12609.913849] ata4.00: status: { DRDY }
[12609.913853] ata4.00: failed command: WRITE FPDMA QUEUED
[12609.913863] ata4.00: cmd 61/00:a8:00:20:ad/09:00:09:00:00/40 tag 21 ncq 1179648 out
res 40/00:b4:00:60:ad/00:00:09:00:00/40 Emask 0x10 (ATA bus error)
[12609.913867] ata4.00: status: { DRDY }
[12609.913871] ata4.00: failed command: WRITE FPDMA QUEUED
[12609.913880] ata4.00: cmd 61/00:b0:00:60:ad/09:00:09:00:00/40 tag 22 ncq 1179648 out
res 40/00:b4:00:60:ad/00:00:09:00:00/40 Emask 0x10 (ATA bus error)
[12609.913885] ata4.00: status: { DRDY }
有谁知道这里的问题是什么以及我该怎么办?如果没有可用的备份驱动器,我感觉不安全。
这是在 Fedora 23 系统上,内核为 4.4.6-300。可能还有一些更新的版本,但是,我实际上想在应用最新更新之前备份它,这就是我首先遇到这些问题的原因。
答案1
好吧,看来这实际上是一个与控制器相关的问题。我将驱动器连接到一个较旧的控制器。现在它的运行速度明显变慢了,但到目前为止,它已经完全同步(从格式化开始),没有出现丢失。
当然,我会密切关注这个硬盘,但我认为它没问题。可能我更换的旧 Seagate 硬盘也没问题,只是运行在有故障的控制器上。
这是一个奇怪的问题,但是......