2 个硬盘之间的文件损坏

2 个硬盘之间的文件损坏

我有一台运行 Ubuntu 16.04 LTS 服务器 i386(内核 4.4.211-0404211-generic)的 NAS。我将它与 samba 一起使用,通过我的家庭网络共享电影和音乐。

我尝试将大约 100 Go 数据从 HDD 复制到最近购买的另一个 HDD,两者都是 ext4 格式。

两者都由 fstab 在启动时安装,如下所示:

UUID=... /media/HDD{number_of_hdd} ext4 defaults,errors=remount-ro 0 2

但我尝试过的一切都失败了,原因是:一切都(部分)损坏了。我的所有文件都是可读的,例如,我的 Windows PC 上的 VLC 认为我的 samba 共享,但音乐中有很多小停顿或视频中的视觉损坏,我注意到有很多解码块Dropped (discontinued)在 VLC 的统计数据中标记为页,每次我发现音频或视觉损坏时都会再一页。

因此,我检查了校验和(在源文件和目标文件上使用7z h {file} -scrcSHA256md5sum),并且每次它们都不同。

我用cprsync来制作副本,都失败了。我使用过的完整命令:

cp -r {source} {destination}
rsync -Pa {source} {destination}

我检查了两个硬盘的智能值,没有任何问题,然后我fsck -f -y也对两个硬盘进行了检查,fsck 告诉我一切都很好。我也用 进行了内存测试memtest86+,一切都很好。没有可用的更新apt update

经过几个小时的尝试发现问题后,我注意到了一些事情(都使用cprsync):

  • 对于使用创建的小文件(用 20Mo 测试)truncate,校验和是正确的
  • 对于包含小文件的文件夹(大约 20Mo 的 5 个文件),校验和是正确的
  • 如果我尝试复制 100Go 的整个目录,两个命令都会告诉我一切正常,但校验和不匹配。

我比较了原始目录和损坏目录中的两个文件,发现了一些差异: 1 2]

如有必要,我可以发送包含原始文件和损坏文件的存档。

服务器规格:

==================================================
                            system     A7N8X-E
/0                          bus        A7N8X-E
/0/0                        memory     64KiB BIOS
/0/4                        processor  AMD Athlon(tm) XP 2800+
/0/4/9                      memory     128KiB L1 cache
/0/4/a                      memory     512KiB L2 cache
/0/26                       memory     3GiB System Memory
/0/26/0                     memory     1GiB DIMM DRAM Synchronous
/0/26/1                     memory     1GiB DIMM DRAM Synchronous
/0/26/2                     memory     1GiB DIMM DRAM Synchronous
/0/100                      bridge     nForce2 IGP2
/0/100/0.1                  memory     RAM memory
/0/100/0.2                  memory     RAM memory
/0/100/0.3                  memory     RAM memory
/0/100/0.4                  memory     RAM memory
/0/100/0.5                  memory     RAM memory
/0/100/1                    bridge     nForce2 ISA Bridge
/0/100/1.1                  bus        nForce2 SMBus (MCP)
/0/100/2                    bus        nForce2 USB Controller
/0/100/2/1      usb2        bus        OHCI PCI host controller
/0/100/2.1                  bus        nForce2 USB Controller
/0/100/2.1/1    usb3        bus        OHCI PCI host controller
/0/100/2.2                  bus        nForce2 USB Controller
/0/100/2.2/1    usb1        bus        EHCI Host Controller
/0/100/4        enp0s4      network    nForce2 Ethernet Controller
/0/100/8                    bridge     nForce2 External PCI Bridge
/0/100/8/4      enp1s4      network    88E8001 Gigabit Ethernet Controller
/0/100/8/a                  storage    SiI 3114 [SATALink/SATARaid] Serial ATA Controller
/0/100/9                    storage    nForce2 IDE
/0/100/1e                   bridge     nForce2 AGP
/0/1            scsi0       storage
/0/1/0.0.0      /dev/sda    disk       1TB ST1000DM010-2EP1
/0/1/0.0.0/1    /dev/sda1   volume     928GiB EXT4 volume
/0/1/0.0.0/2    /dev/sda2   volume     3070MiB Extended partition
/0/1/0.0.0/2/5  /dev/sda5   volume     3070MiB Linux swap / Solaris partition
/0/2            scsi1       storage
/0/2/0.0.0      /dev/sdb    disk       1TB ST1000LM048-2E71
/0/2/0.0.0/1    /dev/sdb1   volume     465GiB EXT4 volume
/0/2/0.0.0/2    /dev/sdb2   volume     465GiB Linux filesystem partition
/0/3            scsi2       storage
/0/3/0.0.0      /dev/sdc    disk       3TB ST3000DM007-1WY1
/0/3/0.0.0/1    /dev/sdc1   volume     2794GiB EXT4 volume
/0/5            scsi3       storage
/0/5/0.0.0      /dev/sdd    disk       1TB ST1000LM048-2E71
/0/5/0.0.0/1    /dev/sdd1   volume     931GiB Windows NTFS volume
/1              virbr0-nic  network    Ethernet interface

希望这个问题能尽快得到解决。谢谢

编辑 1 (01/26/2020) : - 我已经运行了 3 个小时半(精确地通过了 2 次)memtest86+,RAM 没有任何问题。在此输入图像描述 - 我还检查了dmesg有关损坏的任何消息(CRC 错误),没有报告任何内容,即使我制作了副本(目的地的校验和也有错误)...

- 我正在两个 HDD(源和目标)上进行 2 个长时间的 SMART 测试,以检查是否没有任何问题

- 我还注意到,如果我在此过程中停止文件的复制(例如使用 CTRL + C),它似乎会损坏 ext4 文件系统,我不知道如何以及为什么......

编辑 2(01/26/2020):原因是两个硬盘有两个报告:

源硬盘:

smartctl 6.5 2016-01-24 r4214 [i686-linux-4.4.211-0404211-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST3000DM007-1WY10G
Serial Number:    WFN2CMWR
LU WWN Device Id: 5 000c50 0cc67ff74
Firmware Version: 0001
User Capacity:    3 000 592 982 016 bytes [3,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5425 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Sun Jan 26 21:35:04 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x73) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 359) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x30a5) SCT Status supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   079   064   006    Pre-fail  Always       -       74326255
  3 Spin_Up_Time            0x0003   096   096   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       13
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   067   060   045    Pre-fail  Always       -       5351732
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       66 (132 227 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       13
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   067   060   040    Old_age   Always       -       33 (Min/Max 30/38)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       23
194 Temperature_Celsius     0x0022   033   040   000    Old_age   Always       -       33 (0 23 0 0 0)
195 Hardware_ECC_Recovered  0x001a   079   064   000    Old_age   Always       -       74326255
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       55 (238 164 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       2782921433
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       1050108775

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        66         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

目标硬盘:

Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     ST1000LM048-2E7172
Serial Number:    ZDEBV755
LU WWN Device Id: 5 000c50 0b24d84fd
Firmware Version: SDM1
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Sun Jan 26 18:34:49 2020 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x71) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 162) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x3035) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   083   064   006    Pre-fail  Always       -       193912808
  3 Spin_Up_Time            0x0003   099   099   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       228
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   070   060   045    Pre-fail  Always       -       9188472
  9 Power_On_Hours          0x0032   092   092   000    Old_age   Always       -       7832 (182 235 0)
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       197
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   074   058   040    Old_age   Always       -       26 (Min/Max 26/31)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       78
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       1655
194 Temperature_Celsius     0x0022   026   042   000    Old_age   Always       -       26 (0 16 0 0 0)
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       1
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       7545 (117 142 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       5699868670
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       10646554103
254 Free_Fall_Sensor        0x0032   100   100   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      7832         -
# 2  Short offline       Completed without error       00%      7830         -
# 3  Short offline       Completed without error       00%      3141         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

编辑 3(2020 年 2 月 15 日): 我最终通过设置libata.force=noncqlinux 参数解决了这个问题,看来这显然是内核中的一个错误,并且应该很快就会得到修复(https://bugs.launchpad.net/ubuntu/+bug/1861300

答案1

(注意:这假设您没有收到一堆内核错误(检查dmesgjournalctl -b -k)或驱动器 SMART 状态中指示的大量 CRC 错误。如果您是...首先要尝试一些软件操作,例如关闭 NCQ .)

通常,这意味着 RAM 损坏。即使 memtest86+ 通过了(你运行了多长时间?)除非你有 ECC RAM,我对这些规格表示怀疑。

确保您没有做过任何疯狂的事情,例如找到 1 米以上的 SATA 电缆并将它们缠绕在 CPU 上。尽管 SATA 数据传输具有 CRC,但如果此处发生损坏,您应该会收到大量错误。 SATA 电缆很便宜,您可以随时尝试更换它们。

如果您不仅仅想更换 RAM,下一步就是尝试缩小损坏发生时间的范围。

在每个驱动器上,md5sum对显示问题的大文件(需要类似于 2x RAM,以阻止从缓存中检查它)或文件集重复运行或类似操作。做很多次,比如几个小时。你总是得到相同的结果吗?如果不是,则读取路径上存在损坏;如果你总是得到相同的结果,那么读取时可能没有损坏。这使得 RAM 不太可能出现。

如果两个磁盘上都出现读取损坏,请从更换 RAM 开始。如果这不能解决问题,您可以尝试电源,最后尝试 SATA 控制器(可能焊接到主板上,因此您必须更换它)。

如果其中一个磁盘(而非两个磁盘)出现读取损坏,请更换该磁盘。如果这不能解决问题,并且您有背板(用于服务器中的热插拔),则它可能有缺陷。您也可以尝试更换电缆。尝试不同的 SATA 端口。这里的假设是可能会发生坏磁盘,但是可能性很小。老实说...在假设两个磁盘坏之前我会交换内存。

如果两个都磁盘始终读回相同的数据,首先确认您实际上检查了足够的数据以确保其没有被缓存;我想要至少两倍的内存。然后,您可以重复地将一些已知数据写入每个磁盘,并查看读回它是否会给出不同的值。然后与上面的解决方案几乎相同。

PS:这样的腐败是阴险的。特别是,它可能会损坏您的 Linux 发行版的随机位,而不仅仅是您的数据。解决原因后,通常最好重新安装。至少,您需要根据已知的良好校验和检查每个发行版提供的文件;一些发行版提供了执行此操作的实用程序。这仍然不能确认动态发行版数据文件(例如,已安装的软件包列表)没有损坏,但至少您可以确定二进制文件没有问题。

答案2

很久以前,我曾经遇到过同样的问题。问题出在 BIOS 损坏上。这不太可能是 RAM,就好像那是 RAM,您也应该“免费”随机崩溃,并且问题将发生在两个驱动器上,而不是一个驱动器上(我是否正确地假设问题仅发生在新驱动器上?)

我会重点关注这一点:从流程中排除副本和。将其替换为只写。使用 dd 创建具有大块大小的文件(dd if=/dev/zero of=myfile bs=1M count=100)。找到精确的尺寸然后它就会破裂。

相关内容