ZFS 在清理时不断修复同一个磁盘 - 这意味着什么？

2024-5-31 • tag-icon

我有一个由 6 个磁盘组成的 ZFS 池，排列在 3 个条带镜像中。服务器是 Supermicro X11SSM-F，配备 Xeon CPU、32 GB ECC RAM，运行 Ubuntu 17.04。我使用 2 个 Icy Dock MB154SP-B 来物理托管磁盘，主板上有 8 个 SATA 3 连接器，因此磁盘直接呈现给 ZFS（中间没有 RAID 卡）。

直到最近，此设置都运行良好。然后我在运行时注意到zpool status最后一次清理已修复了一些数据：

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub repaired 2.98M in 4h56m with 0 errors on Sun Jul  9 05:20:16 2017
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

出于好奇，我决定进行一次新的清理：

$ sudo zpool scrub cloudpool
... giving it a few minutes to run ...

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub in progress since Tue Jul 11 22:55:12 2017
    124M scanned out of 4.52T at 4.59M/s, 286h55m to go
    256K repaired, 0.00% done
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0  (repairing)
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

在我让它完成后，我得到了以下信息：

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub repaired 624K in 4h35m with 0 errors on Wed Jul 12 03:31:00 2017
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

然后我决定再次开始清理池子。运行一段时间后，我得到了以下结果：

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub in progress since Wed Jul 12 09:55:19 2017
    941G scanned out of 4.52T at 282M/s, 3h42m to go
    112K repaired, 20.34% done
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0  (repairing)
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

查看磁盘的 SMART 数据时，我没有看到任何可疑的东西（也许除了Raw_Read_Error_Rate？）：

smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.10.0-26-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Red
Device Model:     WDC WD40EFRX-68WT0N0
Serial Number:    WD-WCC4E5NFLRU3
LU WWN Device Id: 5 0014ee 262ee543f
Firmware Version: 82.00A82
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Jul 12 10:19:08 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (52020) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 520) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       99
  3 Spin_Up_Time            0x0027   186   176   021    Pre-fail  Always       -       7683
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6735
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7500
194 Temperature_Celsius     0x0022   110   108   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%        13         -
# 2  Conveyance offline  Completed without error       00%         1         -
# 3  Short offline       Completed without error       00%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

然而，我在输出中看到了一些奇怪的消息dmesg：

[100240.777601] ata2.00: exception Emask 0x0 SAct 0x3000000 SErr 0x0 action 0x0
[100240.777608] ata2.00: irq_stat 0x40000008
[100240.777614] ata2.00: failed command: READ FPDMA QUEUED
[100240.777624] ata2.00: cmd 60/00:c0:c8:bc:01/01:00:00:00:00/40 tag 24 ncq dma 131072 in
                         res 41/40:00:a8:bd:01/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[100240.777628] ata2.00: status: { DRDY ERR }
[100240.777631] ata2.00: error: { UNC }
[100240.779320] ata2.00: configured for UDMA/133
[100240.779342] sd 1:0:0:0: [sdb] tag#24 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100240.779346] sd 1:0:0:0: [sdb] tag#24 Sense Key : Medium Error [current] 
[100240.779350] sd 1:0:0:0: [sdb] tag#24 Add. Sense: Unrecovered read error - auto reallocate failed
[100240.779354] sd 1:0:0:0: [sdb] tag#24 CDB: Read(16) 88 00 00 00 00 00 00 01 bc c8 00 00 01 00 00 00
[100240.779357] blk_update_request: I/O error, dev sdb, sector 114088
[100240.779384] ata2: EH complete
[100244.165785] ata2.00: exception Emask 0x0 SAct 0x3d SErr 0x0 action 0x0
[100244.165793] ata2.00: irq_stat 0x40000008
[100244.165798] ata2.00: failed command: READ FPDMA QUEUED
[100244.165807] ata2.00: cmd 60/00:00:c8:be:01/01:00:00:00:00/40 tag 0 ncq dma 131072 in
                         res 41/40:00:70:bf:01/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[100244.165811] ata2.00: status: { DRDY ERR }
[100244.165814] ata2.00: error: { UNC }
[100244.167465] ata2.00: configured for UDMA/133
[100244.167488] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100244.167492] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] 
[100244.167496] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[100244.167500] sd 1:0:0:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 01 be c8 00 00 01 00 00 00
[100244.167503] blk_update_request: I/O error, dev sdb, sector 114544
[100244.167531] ata2: EH complete
[100248.177949] ata2.00: exception Emask 0x0 SAct 0x41c00002 SErr 0x0 action 0x0
[100248.177957] ata2.00: irq_stat 0x40000008
[100248.177963] ata2.00: failed command: READ FPDMA QUEUED
[100248.177972] ata2.00: cmd 60/00:f0:c8:c0:01/01:00:00:00:00/40 tag 30 ncq dma 131072 in
                         res 41/40:00:b8:c1:01/00:00:00:00:00/40 Emask 0x409 (media error) <F>
[100248.177977] ata2.00: status: { DRDY ERR }
[100248.177980] ata2.00: error: { UNC }
[100248.179638] ata2.00: configured for UDMA/133
[100248.179667] sd 1:0:0:0: [sdb] tag#30 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100248.179671] sd 1:0:0:0: [sdb] tag#30 Sense Key : Medium Error [current] 
[100248.179675] sd 1:0:0:0: [sdb] tag#30 Add. Sense: Unrecovered read error - auto reallocate failed
[100248.179679] sd 1:0:0:0: [sdb] tag#30 CDB: Read(16) 88 00 00 00 00 00 00 01 c0 c8 00 00 01 00 00 00
[100248.179682] blk_update_request: I/O error, dev sdb, sector 115128
[100248.179705] ata2: EH complete
...

通过 grepping dmesg，我在日志中看到 31 个这样的实例：

[100240.779357] blk_update_request: I/O error, dev sdb, sector 114088
[100244.167503] blk_update_request: I/O error, dev sdb, sector 114544
[100248.179682] blk_update_request: I/O error, dev sdb, sector 115128
[100251.599649] blk_update_request: I/O error, dev sdb, sector 115272
[100255.812020] blk_update_request: I/O error, dev sdb, sector 115576
[100259.636088] blk_update_request: I/O error, dev sdb, sector 115768
[100263.400169] blk_update_request: I/O error, dev sdb, sector 116000
[100267.912099] blk_update_request: I/O error, dev sdb, sector 116472
[100271.300223] blk_update_request: I/O error, dev sdb, sector 116680
[100274.732989] blk_update_request: I/O error, dev sdb, sector 117000
[100279.665331] blk_update_request: I/O error, dev sdb, sector 118624
[100283.043738] blk_update_request: I/O error, dev sdb, sector 118768
[100286.456260] blk_update_request: I/O error, dev sdb, sector 119072
[100293.472354] blk_update_request: I/O error, dev sdb, sector 7814018576
[100298.443416] blk_update_request: I/O error, dev sdb, sector 119496
[100302.236908] blk_update_request: I/O error, dev sdb, sector 119968
[100305.655675] blk_update_request: I/O error, dev sdb, sector 120032
[100309.450754] blk_update_request: I/O error, dev sdb, sector 120496
[100313.724792] blk_update_request: I/O error, dev sdb, sector 121512
[100324.782008] blk_update_request: I/O error, dev sdb, sector 186032
[100329.002031] blk_update_request: I/O error, dev sdb, sector 189536
[100333.057101] blk_update_request: I/O error, dev sdb, sector 189680
[100336.476953] blk_update_request: I/O error, dev sdb, sector 189888
[100341.133527] blk_update_request: I/O error, dev sdb, sector 190408
[100349.890540] blk_update_request: I/O error, dev sdb, sector 191824
[353944.190625] blk_update_request: I/O error, dev sdb, sector 115480
[353951.660635] blk_update_request: I/O error, dev sdb, sector 116536
[353959.391011] blk_update_request: I/O error, dev sdb, sector 118976
[353966.811863] blk_update_request: I/O error, dev sdb, sector 120176
[353978.447354] blk_update_request: I/O error, dev sdb, sector 189984
[393732.681767] blk_update_request: I/O error, dev sdb, sector 190000

我不太清楚该如何理解：

为什么清理过程会继续修复同一磁盘上的数据？修复的数据量正在减少，这似乎表明数据正在“持久”修复，但为什么在清理过程间隔另一次清理几个小时后仍然有数据需要修复？
zpool status尽管 ZFS 在每次清理时都发现一些需要纠正的数据，为什么我没有看到任何读/写/校验和错误的迹象？
为什么我在磁盘上看到 URE，但在 SMART 报告中却没有任何可疑内容？
这是什么auto reallocate failed意思？磁盘已经用完了替换块吗？这是一个在过去约 6 个月内运行顺利的系统，所以我预计这个磁盘的问题会很快出现。

更实际的是，这对这个特定的磁盘意味着什么？它需要更换吗？

编辑#1经过最近的清理后，我现在得到以下信息：

$ sudo zpool status
  pool: cloudpool
 state: ONLINE
  scan: scrub repaired 0 in 4h35m with 0 errors on Wed Jul 12 21:44:41 2017
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

Smartctl 现在说：

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       122
  3 Spin_Up_Time            0x0027   186   176   021    Pre-fail  Always       -       7683
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       33
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   091   091   000    Old_age   Always       -       6749
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       33
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       5
193 Load_Cycle_Count        0x0032   198   198   000    Old_age   Always       -       7507
194 Temperature_Celsius     0x0022   114   108   000    Old_age   Always       -       38
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

那么...我很好？那到底是怎么回事？

答案1

对于后人而言：当人们读到这样的错误时

[100244.167488] sd 1:0:0:0: [sdb] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[100244.167492] sd 1:0:0:0: [sdb] tag#0 Sense Key : Medium Error [current] 
[100244.167496] sd 1:0:0:0: [sdb] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
[100244.167500] sd 1:0:0:0: [sdb] tag#0 CDB: Read(16) 88 00 00 00 00 00 00 01 be c8 00 00 01 00 00 00
[100244.167503] blk_update_request: I/O error, dev sdb, sector 114544

这意味着物理磁盘表面存在某种缺陷。

更具体地说，该消息Sense: Unrecovered read error - auto reallocate failed意味着磁盘遇到了不可恢复的读取错误。但是什么是无法恢复的读取错误？

磁盘以扇区为单位读取数据，每个扇区都有专用的 ECC。当 ECC 错误超过特定阈值时，磁盘固件会自动重新映射刚刚读取的扇区 - 对用户而言是透明的。在这种情况下不会记录任何内核错误，观察此类行为的唯一方法是通过 SMART 属性。

但是，如果扇区根本无法读取（可能是因为累积了太多错误，ECC 无法检索原始数据），Unrecovered read error - auto reallocate failed则会显示内核错误消息。如果这种情况发生在单磁盘（或 RAID0）系统上，则数据确实丢失了 - 您只能从备份中检索它。

如果使用具有冗余的 RAID 级别 (RAID1/5/6)，系统可以维修通过覆盖来修复坏扇区：磁盘将使用其中一个备用扇区重新映射故障扇区，并立即用其他磁盘获取的良好数据副本覆盖该扇区。如果没有备用扇区可用，内核将记录一个failed command: WRITE FPDMA QUEUED，您应该尽快更换磁盘。

海报中的具体情况显示了一个mirror设置，这意味着 ZFS 能够修复/重新映射故障扇区。如果磁盘仅显示少量坏扇区和重新映射扇区，且没有显著增长，则可以持续很长时间。另一方面，如果内核日志中经常发现此类错误，则必须考虑尽快更换故障磁盘。

答案1

相关内容