Linux 服务器上的 ZFS 偶尔会停止然后恢复

Linux 服务器上的 ZFS 偶尔会停止然后恢复

我遇到了一个问题,调试起来非常困难。运行 ZFS 时,系统出现“故障”,将一些信息转储到 DMESG 中,然后继续工作。

我的 ZFS 在 ESXi 上托管虚拟机。当出现此问题时,许多虚拟机都会遇到块 IO 错误,其中一些会进入只读模式,需要从备份或 fsck 恢复才能修复文件系统。此问题只会偶尔发生,我已经对系统进行了重击,试图对其进行压力测试,它似乎与性能无关。每隔几个月才会发生一次,因此彻底解决它对我来说似乎是白日梦。

首先,一些关于我的系统(Centos 7、4.5)的信息。

[root@zfs-head ~]# name -a

Linux zfs-head 4.5.0-1.el7.elrepo.x86_64 #1 SMP Mon Mar 14 10:24:58 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux

dmesg 条目:

[4331253.022999] sd 2:0:28:0: [sdaa] tag#2 CDB: Read(10) 28 00 10 a8 3d b5 00 00 20 00
[4331253.023006] mpt3sas_cm0:   sas_address(0x5000c500837f31f2), phy(8)
[4331253.023008] mpt3sas_cm0:   enclosure_logical_id(0x50010c60004d41ff),slot(0)
[4331253.023010] mpt3sas_cm0:   enclosure level(0x0003), connector name(     )
[4331253.023013] mpt3sas_cm0:   handle(0x002d), ioc_status(scsi data underrun)(0x0045), smid(222)
[4331253.023016] mpt3sas_cm0:   request_len(131072), underflow(16384), resid(131072)
[4331253.023018] mpt3sas_cm0:   tag(0), transfer_count(0), sc->result(0x00000000)
[4331253.023020] mpt3sas_cm0:   scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331253.023023] mpt3sas_cm0:   [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331253.023030] sd 2:0:28:0: Mode parameters changed
[4331266.475222] sd 2:0:29:0: [sdab] tag#29 CDB: Write(10) 2a 00 09 97 6e c1 00 00 02 00
[4331266.475229] mpt3sas_cm0:   sas_address(0x5000c500837f25c6), phy(9)
[4331266.475232] mpt3sas_cm0:   enclosure_logical_id(0x50010c60004d41ff),slot(1)
[4331266.475234] mpt3sas_cm0:   enclosure level(0x0003), connector name(     )
[4331266.475237] mpt3sas_cm0:   handle(0x002e), ioc_status(scsi data underrun)(0x0045), smid(139)
[4331266.475239] mpt3sas_cm0:   request_len(8192), underflow(1024), resid(8192)
[4331266.475241] mpt3sas_cm0:   tag(0), transfer_count(0), sc->result(0x00000000)
[4331266.475244] mpt3sas_cm0:   scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331266.475246] mpt3sas_cm0:   [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331266.475252] sd 2:0:29:0: Mode parameters changed

池状态:

[root@zfs-head ~]# pool status
  pool: storage
 state: ONLINE
  scan: none requested
config:

    NAME                                             STATE     READ WRITE CKSUM
    storage                                          ONLINE       0     0     0
      mirror-0                                       ONLINE       0     0     0
        s1d1                                         ONLINE       0     0     0
        s2d1                                         ONLINE       0     0     0
      mirror-1                                       ONLINE       0     0     0
        s3d1                                         ONLINE       0     0     0
        s4d1                                         ONLINE       0     0     0
      mirror-2                                       ONLINE       0     0     0
        s1d2                                         ONLINE       0     0     0
        s2d2                                         ONLINE       0     0     0
      mirror-3                                       ONLINE       0     0     0
        s3d2                                         ONLINE       0     0     0
        s4d2                                         ONLINE       0     0     0
      mirror-4                                       ONLINE       0     0     0
        s1d3                                         ONLINE       0     0     0
        s2d3                                         ONLINE       0     0     0
      mirror-5                                       ONLINE       0     0     0
        s3d3                                         ONLINE       0     0     0
        s4d3                                         ONLINE       0     0     0
    logs
      ata-Samsung_SSD_850_PRO_128GB_S24ZNXAGA10768M  ONLINE       0     0     0
    cache
      ata-Samsung_SSD_850_EVO_250GB_S21NNXAG918721R  ONLINE       0     0     0
      ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA59337A  ONLINE       0     0     0
      ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA69590F  ONLINE       0     0     0

errors: No known data errors
[root@zfs-head ~]# 

我的 Vdev 地图:

[root@zfs-head ~]# cat /etc/zfs/vdev_id.conf
#     by-vdev
#     name     fully qualified or base name of device link
alias s1d1       /dev/disk/by-id/scsi-35000c500837ff247
alias s1d2       /dev/disk/by-id/scsi-35000c500837f15c3
alias s1d3       /dev/disk/by-id/scsi-35000c500837f137f
alias s2d1       /dev/disk/by-id/scsi-35000c500837f377b
alias s2d2       /dev/disk/by-id/scsi-35000c500837f5bf7
alias s2d3       /dev/disk/by-id/scsi-35000c500837f75bf
alias s3d1       /dev/disk/by-id/scsi-35000c500837f14d3
alias s3d2       /dev/disk/by-id/scsi-35000c500837f571b
alias s3d3       /dev/disk/by-id/scsi-35000c500837f604f
alias s4d1       /dev/disk/by-id/scsi-35000c500837f31f3
alias s4d2       /dev/disk/by-id/scsi-35000c500837f25c7
alias s4d3       /dev/disk/by-id/scsi-35000c500837f14cf

[root@zfs-head ~]# 

盒子没有重新启动,甚至没有意识到有问题,除了 dmesg 条目。我尽我所能用 Google 搜索了这些条目,但没有找到任何相关内容。

感谢帮助!

相关内容