我遇到了一个问题,调试起来非常困难。运行 ZFS 时,系统出现“故障”,将一些信息转储到 DMESG 中,然后继续工作。
我的 ZFS 在 ESXi 上托管虚拟机。当出现此问题时,许多虚拟机都会遇到块 IO 错误,其中一些会进入只读模式,需要从备份或 fsck 恢复才能修复文件系统。此问题只会偶尔发生,我已经对系统进行了重击,试图对其进行压力测试,它似乎与性能无关。每隔几个月才会发生一次,因此彻底解决它对我来说似乎是白日梦。
首先,一些关于我的系统(Centos 7、4.5)的信息。
[root@zfs-head ~]# name -a
Linux zfs-head 4.5.0-1.el7.elrepo.x86_64 #1 SMP Mon Mar 14 10:24:58 EDT 2016 x86_64 x86_64 x86_64 GNU/Linux
dmesg 条目:
[4331253.022999] sd 2:0:28:0: [sdaa] tag#2 CDB: Read(10) 28 00 10 a8 3d b5 00 00 20 00
[4331253.023006] mpt3sas_cm0: sas_address(0x5000c500837f31f2), phy(8)
[4331253.023008] mpt3sas_cm0: enclosure_logical_id(0x50010c60004d41ff),slot(0)
[4331253.023010] mpt3sas_cm0: enclosure level(0x0003), connector name( )
[4331253.023013] mpt3sas_cm0: handle(0x002d), ioc_status(scsi data underrun)(0x0045), smid(222)
[4331253.023016] mpt3sas_cm0: request_len(131072), underflow(16384), resid(131072)
[4331253.023018] mpt3sas_cm0: tag(0), transfer_count(0), sc->result(0x00000000)
[4331253.023020] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331253.023023] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331253.023030] sd 2:0:28:0: Mode parameters changed
[4331266.475222] sd 2:0:29:0: [sdab] tag#29 CDB: Write(10) 2a 00 09 97 6e c1 00 00 02 00
[4331266.475229] mpt3sas_cm0: sas_address(0x5000c500837f25c6), phy(9)
[4331266.475232] mpt3sas_cm0: enclosure_logical_id(0x50010c60004d41ff),slot(1)
[4331266.475234] mpt3sas_cm0: enclosure level(0x0003), connector name( )
[4331266.475237] mpt3sas_cm0: handle(0x002e), ioc_status(scsi data underrun)(0x0045), smid(139)
[4331266.475239] mpt3sas_cm0: request_len(8192), underflow(1024), resid(8192)
[4331266.475241] mpt3sas_cm0: tag(0), transfer_count(0), sc->result(0x00000000)
[4331266.475244] mpt3sas_cm0: scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
[4331266.475246] mpt3sas_cm0: [sense_key,asc,ascq]: [0x06,0x2a,0x01], count(96)
[4331266.475252] sd 2:0:29:0: Mode parameters changed
池状态:
[root@zfs-head ~]# pool status
pool: storage
state: ONLINE
scan: none requested
config:
NAME STATE READ WRITE CKSUM
storage ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
s1d1 ONLINE 0 0 0
s2d1 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
s3d1 ONLINE 0 0 0
s4d1 ONLINE 0 0 0
mirror-2 ONLINE 0 0 0
s1d2 ONLINE 0 0 0
s2d2 ONLINE 0 0 0
mirror-3 ONLINE 0 0 0
s3d2 ONLINE 0 0 0
s4d2 ONLINE 0 0 0
mirror-4 ONLINE 0 0 0
s1d3 ONLINE 0 0 0
s2d3 ONLINE 0 0 0
mirror-5 ONLINE 0 0 0
s3d3 ONLINE 0 0 0
s4d3 ONLINE 0 0 0
logs
ata-Samsung_SSD_850_PRO_128GB_S24ZNXAGA10768M ONLINE 0 0 0
cache
ata-Samsung_SSD_850_EVO_250GB_S21NNXAG918721R ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA59337A ONLINE 0 0 0
ata-Samsung_SSD_850_EVO_250GB_S21NNXAGA69590F ONLINE 0 0 0
errors: No known data errors
[root@zfs-head ~]#
我的 Vdev 地图:
[root@zfs-head ~]# cat /etc/zfs/vdev_id.conf
# by-vdev
# name fully qualified or base name of device link
alias s1d1 /dev/disk/by-id/scsi-35000c500837ff247
alias s1d2 /dev/disk/by-id/scsi-35000c500837f15c3
alias s1d3 /dev/disk/by-id/scsi-35000c500837f137f
alias s2d1 /dev/disk/by-id/scsi-35000c500837f377b
alias s2d2 /dev/disk/by-id/scsi-35000c500837f5bf7
alias s2d3 /dev/disk/by-id/scsi-35000c500837f75bf
alias s3d1 /dev/disk/by-id/scsi-35000c500837f14d3
alias s3d2 /dev/disk/by-id/scsi-35000c500837f571b
alias s3d3 /dev/disk/by-id/scsi-35000c500837f604f
alias s4d1 /dev/disk/by-id/scsi-35000c500837f31f3
alias s4d2 /dev/disk/by-id/scsi-35000c500837f25c7
alias s4d3 /dev/disk/by-id/scsi-35000c500837f14cf
[root@zfs-head ~]#
盒子没有重新启动,甚至没有意识到有问题,除了 dmesg 条目。我尽我所能用 Google 搜索了这些条目,但没有找到任何相关内容。
感谢帮助!