我有一台运行 Ubuntu 16.04 LTS 服务器 i386(内核 4.4.211-0404211-generic)的 NAS。我将它与 samba 一起使用,通过我的家庭网络共享电影和音乐。
我尝试将大约 100 Go 数据从 HDD 复制到最近购买的另一个 HDD,两者都是 ext4 格式。
两者都由 fstab 在启动时安装,如下所示:
UUID=... /media/HDD{number_of_hdd} ext4 defaults,errors=remount-ro 0 2
但我尝试过的一切都失败了,原因是:一切都(部分)损坏了。我的所有文件都是可读的,例如,我的 Windows PC 上的 VLC 认为我的 samba 共享,但音乐中有很多小停顿或视频中的视觉损坏,我注意到有很多解码块Dropped (discontinued)
在 VLC 的统计数据中标记为页,每次我发现音频或视觉损坏时都会再一页。
因此,我检查了校验和(在源文件和目标文件上使用7z h {file} -scrcSHA256
和md5sum
),并且每次它们都不同。
我用cp
和rsync
来制作副本,都失败了。我使用过的完整命令:
cp -r {source} {destination}
rsync -Pa {source} {destination}
我检查了两个硬盘的智能值,没有任何问题,然后我fsck -f -y
也对两个硬盘进行了检查,fsck 告诉我一切都很好。我也用 进行了内存测试memtest86+
,一切都很好。没有可用的更新apt update
。
经过几个小时的尝试发现问题后,我注意到了一些事情(都使用cp
或rsync
):
- 对于使用创建的小文件(用 20Mo 测试)
truncate
,校验和是正确的 - 对于包含小文件的文件夹(大约 20Mo 的 5 个文件),校验和是正确的
- 如果我尝试复制 100Go 的整个目录,两个命令都会告诉我一切正常,但校验和不匹配。
我比较了原始目录和损坏目录中的两个文件,发现了一些差异:
如有必要,我可以发送包含原始文件和损坏文件的存档。
服务器规格:
==================================================
system A7N8X-E
/0 bus A7N8X-E
/0/0 memory 64KiB BIOS
/0/4 processor AMD Athlon(tm) XP 2800+
/0/4/9 memory 128KiB L1 cache
/0/4/a memory 512KiB L2 cache
/0/26 memory 3GiB System Memory
/0/26/0 memory 1GiB DIMM DRAM Synchronous
/0/26/1 memory 1GiB DIMM DRAM Synchronous
/0/26/2 memory 1GiB DIMM DRAM Synchronous
/0/100 bridge nForce2 IGP2
/0/100/0.1 memory RAM memory
/0/100/0.2 memory RAM memory
/0/100/0.3 memory RAM memory
/0/100/0.4 memory RAM memory
/0/100/0.5 memory RAM memory
/0/100/1 bridge nForce2 ISA Bridge
/0/100/1.1 bus nForce2 SMBus (MCP)
/0/100/2 bus nForce2 USB Controller
/0/100/2/1 usb2 bus OHCI PCI host controller
/0/100/2.1 bus nForce2 USB Controller
/0/100/2.1/1 usb3 bus OHCI PCI host controller
/0/100/2.2 bus nForce2 USB Controller
/0/100/2.2/1 usb1 bus EHCI Host Controller
/0/100/4 enp0s4 network nForce2 Ethernet Controller
/0/100/8 bridge nForce2 External PCI Bridge
/0/100/8/4 enp1s4 network 88E8001 Gigabit Ethernet Controller
/0/100/8/a storage SiI 3114 [SATALink/SATARaid] Serial ATA Controller
/0/100/9 storage nForce2 IDE
/0/100/1e bridge nForce2 AGP
/0/1 scsi0 storage
/0/1/0.0.0 /dev/sda disk 1TB ST1000DM010-2EP1
/0/1/0.0.0/1 /dev/sda1 volume 928GiB EXT4 volume
/0/1/0.0.0/2 /dev/sda2 volume 3070MiB Extended partition
/0/1/0.0.0/2/5 /dev/sda5 volume 3070MiB Linux swap / Solaris partition
/0/2 scsi1 storage
/0/2/0.0.0 /dev/sdb disk 1TB ST1000LM048-2E71
/0/2/0.0.0/1 /dev/sdb1 volume 465GiB EXT4 volume
/0/2/0.0.0/2 /dev/sdb2 volume 465GiB Linux filesystem partition
/0/3 scsi2 storage
/0/3/0.0.0 /dev/sdc disk 3TB ST3000DM007-1WY1
/0/3/0.0.0/1 /dev/sdc1 volume 2794GiB EXT4 volume
/0/5 scsi3 storage
/0/5/0.0.0 /dev/sdd disk 1TB ST1000LM048-2E71
/0/5/0.0.0/1 /dev/sdd1 volume 931GiB Windows NTFS volume
/1 virbr0-nic network Ethernet interface
希望这个问题能尽快得到解决。谢谢
编辑 1 (01/26/2020) : - 我已经运行了 3 个小时半(精确地通过了 2 次)memtest86+
,RAM 没有任何问题。
- 我还检查了dmesg
有关损坏的任何消息(CRC 错误),没有报告任何内容,即使我制作了副本(目的地的校验和也有错误)...
- 我正在两个 HDD(源和目标)上进行 2 个长时间的 SMART 测试,以检查是否没有任何问题
- 我还注意到,如果我在此过程中停止文件的复制(例如使用 CTRL + C),它似乎会损坏 ext4 文件系统,我不知道如何以及为什么......
编辑 2(01/26/2020):原因是两个硬盘有两个报告:
源硬盘:
smartctl 6.5 2016-01-24 r4214 [i686-linux-4.4.211-0404211-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST3000DM007-1WY10G
Serial Number: WFN2CMWR
LU WWN Device Id: 5 000c50 0cc67ff74
Firmware Version: 0001
User Capacity: 3 000 592 982 016 bytes [3,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5425 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Sun Jan 26 21:35:04 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x73) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 359) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x30a5) SCT Status supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 079 064 006 Pre-fail Always - 74326255
3 Spin_Up_Time 0x0003 096 096 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 13
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 067 060 045 Pre-fail Always - 5351732
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 66 (132 227 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 13
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 067 060 040 Old_age Always - 33 (Min/Max 30/38)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 5
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 23
194 Temperature_Celsius 0x0022 033 040 000 Old_age Always - 33 (0 23 0 0 0)
195 Hardware_ECC_Recovered 0x001a 079 064 000 Old_age Always - 74326255
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 55 (238 164 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 2782921433
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 1050108775
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 66 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
目标硬盘:
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: ST1000LM048-2E7172
Serial Number: ZDEBV755
LU WWN Device Id: 5 000c50 0b24d84fd
Firmware Version: SDM1
User Capacity: 1 000 204 886 016 bytes [1,00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 3b
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is: Sun Jan 26 18:34:49 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x71) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 1) minutes.
Extended self-test routine
recommended polling time: ( 162) minutes.
Conveyance self-test routine
recommended polling time: ( 2) minutes.
SCT capabilities: (0x3035) SCT Status supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 083 064 006 Pre-fail Always - 193912808
3 Spin_Up_Time 0x0003 099 099 000 Pre-fail Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age Always - 228
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 070 060 045 Pre-fail Always - 9188472
9 Power_On_Hours 0x0032 092 092 000 Old_age Always - 7832 (182 235 0)
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age Always - 197
184 End-to-End_Error 0x0032 100 100 099 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age Always - 0
189 High_Fly_Writes 0x003a 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0022 074 058 040 Old_age Always - 26 (Min/Max 26/31)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 78
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 1655
194 Temperature_Celsius 0x0022 026 042 000 Old_age Always - 26 (0 16 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 1
240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline - 7545 (117 142 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline - 5699868670
242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline - 10646554103
254 Free_Fall_Sensor 0x0032 100 100 000 Old_age Always - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 7832 -
# 2 Short offline Completed without error 00% 7830 -
# 3 Short offline Completed without error 00% 3141 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
编辑 3(2020 年 2 月 15 日):
我最终通过设置libata.force=noncq
linux 参数解决了这个问题,看来这显然是内核中的一个错误,并且应该很快就会得到修复(https://bugs.launchpad.net/ubuntu/+bug/1861300)
答案1
(注意:这假设您没有收到一堆内核错误(检查dmesg
或journalctl -b -k
)或驱动器 SMART 状态中指示的大量 CRC 错误。如果您是...首先要尝试一些软件操作,例如关闭 NCQ .)
通常,这意味着 RAM 损坏。即使 memtest86+ 通过了(你运行了多长时间?)除非你有 ECC RAM,我对这些规格表示怀疑。
确保您没有做过任何疯狂的事情,例如找到 1 米以上的 SATA 电缆并将它们缠绕在 CPU 上。尽管 SATA 数据传输具有 CRC,但如果此处发生损坏,您应该会收到大量错误。 SATA 电缆很便宜,您可以随时尝试更换它们。
如果您不仅仅想更换 RAM,下一步就是尝试缩小损坏发生时间的范围。
在每个驱动器上,md5sum
对显示问题的大文件(需要类似于 2x RAM,以阻止从缓存中检查它)或文件集重复运行或类似操作。做很多次,比如几个小时。你总是得到相同的结果吗?如果不是,则读取路径上存在损坏;如果你总是得到相同的结果,那么读取时可能没有损坏。这使得 RAM 不太可能出现。
如果两个磁盘上都出现读取损坏,请从更换 RAM 开始。如果这不能解决问题,您可以尝试电源,最后尝试 SATA 控制器(可能焊接到主板上,因此您必须更换它)。
如果其中一个磁盘(而非两个磁盘)出现读取损坏,请更换该磁盘。如果这不能解决问题,并且您有背板(用于服务器中的热插拔),则它可能有缺陷。您也可以尝试更换电缆。尝试不同的 SATA 端口。这里的假设是一可能会发生坏磁盘,但是二可能性很小。老实说...在假设两个磁盘坏之前我会交换内存。
如果两个都磁盘始终读回相同的数据,首先确认您实际上检查了足够的数据以确保其没有被缓存;我想要至少两倍的内存。然后,您可以重复地将一些已知数据写入每个磁盘,并查看读回它是否会给出不同的值。然后与上面的解决方案几乎相同。
PS:这样的腐败是阴险的。特别是,它可能会损坏您的 Linux 发行版的随机位,而不仅仅是您的数据。解决原因后,通常最好重新安装。至少,您需要根据已知的良好校验和检查每个发行版提供的文件;一些发行版提供了执行此操作的实用程序。这仍然不能确认动态发行版数据文件(例如,已安装的软件包列表)没有损坏,但至少您可以确定二进制文件没有问题。
答案2
很久以前,我曾经遇到过同样的问题。问题出在 BIOS 损坏上。这不太可能是 RAM,就好像那是 RAM,您也应该“免费”随机崩溃,并且问题将发生在两个驱动器上,而不是一个驱动器上(我是否正确地假设问题仅发生在新驱动器上?)
我会重点关注这一点:从流程中排除副本和。将其替换为只写。使用 dd 创建具有大块大小的文件(dd if=/dev/zero of=myfile bs=1M count=100)。找到精确的尺寸然后它就会破裂。