热插拔 SAS 驱动器存在问题。
初始数据:基于 Nexenta 4.0.2 服务器的 Supermicro(MB S5520HC、内部 RAID 控制器 RMS2LL080/LSI 2008)、12 HDD SAS 300G、10 HDD SATA 1T、2 SSD 160G。
磁盘分为三个池:
- 系统 SAS 2x300 镜像
- “快速” SAS 驱动器阵列 10 个(9 个 + 1 个备用)
- “慢速” SATA 阵列 10 个驱动器
在某些时候,两个 SAS 驱动器出现故障:
NAME STATE READ WRITE CKSUM
sas DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
c1t5000C50007DCF821d0 ONLINE 0 0 0
c1t5000CCA0052FFDD5d0 ONLINE 0 0 0
c1t5000CCA005349D15d0 ONLINE 0 0 0
spare-3 FAULTED 0 0 0
c1t5000CCA00534D625d0 FAULTED 0 0 0 external device fault
c1t5000CCA0053658B5d0 ONLINE 0 0 0
c1t5000CCA00534F2D5d0 ONLINE 0 0 0
c1t5000CCA00534F33Dd0 ONLINE 0 0 0
c1t5000CCA00534F471d0 ONLINE 0 0 0
c1t5000CCA0053571D1d0 ONLINE 0 0 0
c1t5000CCA00535A3A5d0 ONLINE 0 0 0
logs
c0t500151795950C847d0 ONLINE 0 0 0
spares
c1t5000CCA0053658B5d0 INUSE currently in use
pool: syspool
state: DEGRADED
status: One or more devices are faulted in response to persistent errors.
Sufficient replicas exist for the pool to continue functioning in a
degraded state.
action: Replace the faulted device, or use 'zpool clear' to mark the device
repaired.
scan: scrub repaired 0 in 0h3m with 0 errors on Sun Oct 4 03:03:56 2015
config:
NAME STATE READ WRITE CKSUM
syspool DEGRADED 0 0 0
mirror-0 DEGRADED 0 0 0
c1t5000C500072BB235d0s0 FAULTED 1 20 0 external device fault
c1t5000C500072BE655d0s0 ONLINE 0 0 0
接下来,为每个磁盘创建一个替换程序 zpool offline / detach 和 cfgadm -c unconfigure。安装一个驱动器(在本例中为 syspool)后,在日志中可以看到以下内容:
Oct 9 18:28:43 nxstor genunix: [ID 408114 kern.info] /pci@0,0/pci8086,3a40@1c/pci8086,350e@0/iport@ff/disk@w5000c50007dcf7ed,0 (sd26) online
Oct 9 18:28:43 nxstor pcplusmp: [ID 805372 kern.info] pcplusmp: pci-ide (pci-ide) instance 0 irq 0x12 vector 0x41 ioapic 0x8 intin 0x12 is bound to cpu 14
Oct 9 18:28:43 nxstor pcplusmp: [ID 805372 kern.info] pcplusmp: pci-ide (pci-ide) instance 1 irq 0x15 vector 0x42 ioapic 0x8 intin 0x15 is bound to cpu 15
Oct 9 18:28:43 nxstor pcplusmp: [ID 805372 kern.info] pcplusmp: pci-ide (pci-ide) instance 0 irq 0x12 vector 0x41 ioapic 0x8 intin 0x12 is bound to cpu 0
Oct 9 18:28:43 nxstor pcplusmp: [ID 805372 kern.info] pcplusmp: pci-ide (pci-ide) instance 1 irq 0x15 vector 0x41 ioapic 0x8 intin 0x15 is bound to cpu 1
Oct 9 18:39:53 nxstor genunix: [ID 888150 kern.warning] WARNING: Device not found in device tree. Skipping device unretire: /pci@0,0/pci8086,3a40@1c/pci8086,350e@0/iport@ff/disk@w5000c500072bb235,0
Oct 9 18:39:53 nxstor genunix: [ID 484473 kern.notice] NOTICE: Not retired: /pci@0,0/pci8086,3a40@1c/pci8086,350e@0/iport@ff/disk@w5000c500072bb235,0
Oct 9 18:39:53 nxstor genunix: [ID 888150 kern.warning] WARNING: Device not found in device tree. Skipping device unretire: /pci@0,0/pci8086,3a40@1c/pci8086,350e@0/iport@ff/disk@w5000cca00534d625,0
磁盘和磁盘@w5000cca00534d625,0磁盘@w5000c500072bb235,0是两个有问题的磁盘,已从系统中删除。
安装的HDD定义cfgadm:
root@nxstor:/volumes# cfgadm -al
Ap_Id Type Receptacle Occupant Condition
Slot2 sas/hp connected configured ok
c1 scsi-sas connected configured unknown
c1::dsk/c1t5000C50007DCF7EDd0 disk connected configured unknown
<cut>
但是,此磁盘格式也无法看到实用程序 fdisk:
root@nxstor:/volumes# fdisk /dev/rdsk/c1t5000C50007DCF7EDd0
fdisk: Cannot stat device /dev/rdsk/c1t5000C50007DCF7EDd0.
root@nxstor:/volumes#
root@nxstor:/volumes# ls -la /dev/rdsk/c1t5000C50007DCF7EDd0
/dev/rdsk/c1t5000C50007DCF7EDd0: No such file or directory
此外,NMS 还看到以下内容:
Trigger Name: nms-fmacheck
Fault ID: 5
Error Repeat Count: 5
Error Severity: CRITICAL
Error TimeStamp: Tue Oct 13 14:21:54 2015
Description:
FMA Module: ereport.io.scsi.disk.attach-failure
Details:
List of last errors :
Oct 13 13:48:02.6970 ereport.io.scsi.cmd.disk.tran
<cut>
Oct 13 14:21:05.7075 ereport.io.scsi.cmd.disk.tran
Oct 13 14:21:53.6196 ereport.io.scsi.cmd.disk.dev.rqs.derr
Oct 13 14:21:53.6197 ereport.io.scsi.disk.attach-failure
List of last errors :
=========: Event Details :========
SOURCE: ereport.io.scsi.disk.attach-failure
nvlist version: 0
class = ereport.io.scsi.disk.attach-failure
ena = 0x58420a3d24100401
detector = (embedded nvlist)
nvlist version: 0
version = 0x0
scheme = dev
device-path = /pci@0,0/pci8086,3a40@1c/pci8086,350e@0/iport@ff/disk@w5000c50007dcf7ed,0
devid = id1,sd@n5000c50007dcf7ef
(end detector)
devid = id1,sd@n5000c50007dcf7ef
__ttl = 0x1
__tod = 0x561cdb41 0x24f05942
=========
如果我们假设新驱动器也有缺陷,并把其他驱动器放进去,就会得到更奇怪的画面。安装磁盘日志根本没有出现。但磁盘退出时会显示以下信息:
Oct 13 14:50:26 nxstor scsi: [ID 107833 kern.notice] /pci@0,0/pci8086,3a40@1c/pci8086,350e@0 (mpt_sas0):
Oct 13 14:50:26 nxstor PhysDiskNum 2 with DevHandle 0x23 in slot 0 for enclosure with handle 0x0 is now offline
根据经验,发现该问题仅适用于 SAS 驱动器。这反过来又提示了 cfgadm 中 MPxIO 驱动程序和 mpt_sas 的问题(事实证明这是一个已知的 solaris 问题,并写道它已得到解决,但不清楚它是否在 opensolaris 中传递)。
下一步该怎么办?有谁遇到过类似的问题吗?
驱动程序和固件 SAS 控制器:MPTSAS HBA 驱动程序 00.00.00.24 固件版本 5.40.1.0 SAS 驱动器两种:Hitachi Ultrastar 15K300 HUS153030VLS300 和 Seagate Cheetah 15K.5 ST3300655SS。