为什么 Linux 上的 ZFS 不报告驱动器丢失?

为什么 Linux 上的 ZFS 不报告驱动器丢失?

我发现我的一台个人文件服务器速度慢得像爬行一样。经过进一步检查,发现 ZFS 池中的一个驱动器已停止响应。我在任何 ZFS 统计数据中都找不到这方面的迹象。这是我所看到的:

root@grandidier:/var/log# zpool status -v
  pool: tank
 state: ONLINE
  scan: scrub repaired 348K in 26h40m with 0 errors on Mon Mar 12 04:04:43 2018
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz2-0                  ONLINE       0     0     0
            wwn-0x50014ee655857734  ONLINE       0     0     0
            wwn-0x50014ee2052f74a0  ONLINE       0     0     0
            wwn-0x50014ee2056320c0  ONLINE       0     0     0
            wwn-0x50014ee25b714e7c  ONLINE       0     0     0
            wwn-0x50014ee2afc04a72  ONLINE       0     0     0
            wwn-0x50014ee2afdae114  ONLINE       0     0     0

errors: No known data errors
root@grandidier:/var/log#

然而,如果我尝试检查有问题的驱动器,我会发现

root@grandidier:/var/log# smartctl -a /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.13.0-37-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

Smartctl open device: /dev/sdb [SAT] failed: No such device or address
root@grandidier:/var/log# ls -l  /dev/sdb
brw-rw---- 1 root disk 8, 16 Mar 26 14:54 /dev/sdb
root@grandidier:/var/log#

来自 /var/log/syslog 的更多信息(这些是连续条目,未过滤感兴趣的条目。

Mar 27 09:24:18 grandidier kernel: [68384.375607] sd 8:0:0:0: [sdb] tag#5 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT 
Mar 27 09:24:18 grandidier kernel: [68384.375618] sd 8:0:0:0: [sdb] tag#5 CDB: Write(10) 2a 00 1f c5 c2 d8 00 02 30 00
Mar 27 09:24:18 grandidier kernel: [68384.375887] sd 8:0:0:0: [sdb] tag#4 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT 
Mar 27 09:24:18 grandidier kernel: [68384.375897] sd 8:0:0:0: [sdb] tag#4 CDB: Write(10) 2a 00 1f c5 bd 68 00 01 b0 00
Mar 27 09:24:18 grandidier kernel: [68384.376082] sd 8:0:0:0: [sdb] tag#2 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT 
Mar 27 09:24:18 grandidier kernel: [68384.376088] sd 8:0:0:0: [sdb] tag#2 CDB: Write(10) 2a 00 1f c5 bf 18 00 03 c0 00
Mar 27 09:24:18 grandidier kernel: [68384.378207] sd 8:0:0:0: [sdb] tag#1 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT 
Mar 27 09:24:18 grandidier kernel: [68384.378215] sd 8:0:0:0: [sdb] tag#1 CDB: Write(10) 2a 00 1f c5 bb c0 00 01 a8 00
Mar 27 09:24:18 grandidier kernel: [68384.378330] sd 8:0:0:0: [sdb] tag#3 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD OUT 
Mar 27 09:24:18 grandidier kernel: [68384.378336] sd 8:0:0:0: [sdb] tag#3 CDB: Write(10) 2a 00 1f c5 ba d0 00 00 e8 00
Mar 27 09:24:18 grandidier kernel: [68384.380190] sd 8:0:0:0: [sdb] tag#0 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD OUT 
Mar 27 09:24:18 grandidier kernel: [68384.380200] sd 8:0:0:0: [sdb] tag#0 CDB: Write(10) 2a 00 1f c5 b9 68 00 01 68 00
Mar 27 09:24:18 grandidier kernel: [68384.382231] scsi host8: uas_eh_bus_reset_handler start
Mar 27 09:24:18 grandidier kernel: [68384.512718] usb 9-2: reset SuperSpeed USB device number 3 using xhci_hcd
Mar 27 09:24:18 grandidier kernel: [68384.537848] scsi host8: uas_eh_bus_reset_handler success
Mar 27 09:25:01 grandidier CRON[23432]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Mar 27 09:25:32 grandidier smartd[2263]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 203 to 196
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdb [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 67 to 65
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdd [SAT], FAILED SMART self-check. BACK UP DATA NOW!
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdd [SAT], 38 Currently unreadable (pending) sectors
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdd [SAT], 1 Offline uncorrectable sectors
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdd [SAT], Failed SMART usage Attribute: 5 Reallocated_Sector_Ct.
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdd [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 117 to 118
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdg [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 118 to 117
Mar 27 09:25:33 grandidier smartd[2263]: Device: /dev/sdh [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 119 to 118
Mar 27 09:32:48 grandidier kernel: [68894.329217] sd 8:0:0:0: [sdb] tag#2 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT 
Mar 27 09:32:48 grandidier kernel: [68894.329228] sd 8:0:0:0: [sdb] tag#2 CDB: Write(10) 2a 00 1f c7 f2 b8 00 01 c0 00

该驱动器似乎在大约 24 小时前就开始出现问题。令我惊讶的是,/dev/sdd 并不是在即将退出时掉落的驱动器。

此外,我对报告的温度持怀疑态度,因为其余驱动器的温度现在在 28 到 32°C 之间。

目前我正在尝试重新启动系统并等待其关闭,但它似乎已挂起。看来是时候使用大红色开关了。

操作系统是 Ubuntu 16.04,ZFS 版本似乎是 6.5。

重新启动后,所有驱动器均重新联机,并且 ZFS 仍然指示没有问题。删除的磁盘中的详细信息

root@grandidier:~# smartctl -a /dev/sdb
smartctl 6.5 2016-01-24 r4214 [x86_64-linux-4.13.0-37-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD2003FYPS-27Y2B0
Serial Number:    WD-WCAVY6148882
LU WWN Device Id: 5 0014ee 2afdae114
Firmware Version: 04.05G11
User Capacity:    2,000,398,934,016 bytes [2.00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    5400 rpm
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Wed Mar 28 08:14:25 2018 CDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

...

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       2
  3 Spin_Up_Time            0x0027   233   233   021    Pre-fail  Always       -       10333
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       160
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   034   034   000    Old_age   Always       -       48313
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       149
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       123
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       36
194 Temperature_Celsius     0x0022   122   108   000    Old_age   Always       -       30
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       1

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%     44996         -

hbarta@yggdrasil:~/Documents/Computer/grandidier$ 

我在以下位置找到了类似的帖子检查 ZFS 池中是否有故障驱动器。就我而言,存在涉及驱动器的明确活动,包括夜间备份到该服务器以及将文件从一个文件系统复制到另一个文件系统(两者都在同一个池中)。

谢谢!

相关内容