我在日志中看到一个奇怪的错误消息,其开始如下:
:39:35 host1 kernel: [54674279.243416] mpt2sas0: fault_state(0x2651)!
:39:35 host1 kernel: [54674279.243543] mpt2sas0: sending diag reset !!
:39:36 host1 kernel: [54674280.481215] mpt2sas0: diag reset: SUCCESS
:39:36 host1 kernel: [54674280.713443] mpt2sas0: LSISAS2008: FWVersion(07.15.08.00), ChipRevision(0x03), BiosVersion(07.02.03.00)
:39:36 host1 kernel: [54674280.713451] mpt2sas0: Dell 6Gbps SAS HBA: Vendor(0x1000), Device(0x0072), SSVID(0x1028), SSDID(0x1F1C)
:39:36 host1 kernel: [54674280.713455] mpt2sas0: Protocol=(Initiator,Target), Capabilities=(Raid,TLR,EEDP,Snapshot Buffer,Diag Trace Buffer,Task Set Full,NCQ)
:39:36 host1 kernel: [54674280.713518] mpt2sas0: sending port enable !!
:39:43 host1 kernel: [54674287.616666] mpt2sas0: port enable: SUCCESS
:39:43 host1 kernel: [54674287.616814] mpt2sas0: search for end-devices: start
:39:43 host1 kernel: [54674287.617657] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7)
:39:43 host1 kernel: [54674287.617735] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3)
:39:43 host1 kernel: [54674287.617807] mpt2sas0: search for end-devices: complete
:39:43 host1 kernel: [54674287.617810] mpt2sas0: search for raid volumes: start
:39:43 host1 kernel: [54674287.617813] mpt2sas0: search for responding raid volumes: complete
:39:43 host1 kernel: [54674287.617816] mpt2sas0: search for expanders: start
:39:43 host1 kernel: [54674287.617818] mpt2sas0: search for expanders: complete
:39:43 host1 kernel: [54674287.617833] mpt2sas0: search for end-devices: start
:39:43 host1 kernel: [54674287.618468] scsi target7:0:3: handle(0x0009), sas_addr(0x590b11c410294314), enclosure logical id(0x590b11c007729400), slot(7)
:39:43 host1 kernel: [54674287.618543] scsi target7:0:2: handle(0x000a), sas_addr(0x590b11c41025f914), enclosure logical id(0x590b11c007729400), slot(3)
:39:43 host1 kernel: [54674287.618614] mpt2sas0: search for end-devices: complete
:39:43 host1 kernel: [54674287.618617] mpt2sas0: search for raid volumes: start
:39:43 host1 kernel: [54674287.618619] mpt2sas0: search for responding raid volumes: complete
:39:43 host1 kernel: [54674287.618622] mpt2sas0: search for expanders: start
:39:43 host1 kernel: [54674287.618624] mpt2sas0: search for expanders: complete
:39:43 host1 kernel: [54674287.618632] mpt2sas0: _base_fault_reset_work: hard reset: success
:39:43 host1 kernel: [54674287.618639] mpt2sas0: removing unresponding devices: start
:39:43 host1 kernel: [54674287.618642] mpt2sas0: removing unresponding devices: complete
:39:43 host1 kernel: [54674287.618654] mpt2sas0: scan devices: start
:39:43 host1 kernel: [54674287.619530] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
:39:43 host1 kernel: [54674287.619866] mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
最后一条消息每秒重复多次。其他相关信息:
这是一台装有老旧 Linux 内核的戴尔机器,通过 SAS 连接到戴尔磁盘阵列。
# uname -a
Linux host1 3.2.0-34-generic #53-Ubuntu SMP Thu Nov 15 10:48:16 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux
# modinfo -F version mpt2sas
10.100.00.00
lspci | grep LSI
01:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 2008 [Falcon] (rev 03)
08:00.0 Serial Attached SCSI controller: LSI Logic / Symbios Logic SAS2008 PCI-Express Fusion-MPT SAS-2 [Falcon] (rev 03)
当向 mpt2sas 添加更多调试时,结果如下:
mpt2sas0: failure at /build/buildd/linux-3.2.0/drivers/scsi/mpt2sas/mpt2sas_scsih.c:5157/_scsih_add_device()!
phy-7:4: refresh: parent sas_addr(0x590b11c007729400),
link_rate(0x08), phy(4)
attached_handle(0x0000), sas_addr(0x0000000000000000)
连接到磁盘阵列不同卷的其他机器正常工作。磁盘阵列和 iDrac 在日志中没有提供任何线索,一切似乎都很正常。谷歌搜索提供了一些恐怖故事,即 RAID 最终会丢弃所有磁盘。问题与异常高的负载无关。
这种行为持续了几个小时。
Red Hat 似乎有非常类似的问题,但还没有解决方案:
https://access.redhat.com/solutions/1990653
不幸的是,我无法重新启动机器来进行实验。