我有一个 Apple XServe RAID,通过光纤通道连接到 Dell Poweredge R610。此服务器主要用于托管 subversion 存储库和存储磁盘映像。在过去 6 个月左右的时间里,我们遇到了一些与此设置有关的问题,即在出现一些错误后,RAID 最终被重新安装为只读。当负载最小时,它似乎没问题,但几天前,当将一些大型磁盘映像复制到它时,它出现了一堆错误,并被重新安装为只读。
实际的错误消息以一堆任务中止开始
May 17 15:20:09 sub0 kernel: [4661904.506886] mptscsih: ioc1: attempting task abort! (sc=ffff88011d2aea00)
May 17 15:20:09 sub0 kernel: [4661904.506890] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 2c ea 00 04 00 00
May 17 15:20:09 sub0 kernel: [4661904.507219] mptscsih: ioc1: task abort: SUCCESS (sc=ffff88011d2aea00)
...
May 17 15:21:42 sub0 kernel: [4661997.476282] mptscsih: ioc1: attempting target reset! (sc=ffff88011e632c00)
May 17 15:21:42 sub0 kernel: [4661997.476284] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00
May 17 15:21:42 sub0 kernel: [4661997.494532] mptscsih: ioc1: target reset: SUCCESS (sc=ffff88011e632c00)
May 17 15:21:42 sub0 kernel: [4661997.494589] mptscsih: ioc1: attempting bus reset! (sc=ffff88011e632c00)
May 17 15:21:42 sub0 kernel: [4661997.494592] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00
May 17 15:21:42 sub0 kernel: [4661997.495403] mptscsih: ioc1: bus reset: SUCCESS (sc=ffff88011e632c00)
May 17 15:21:52 sub0 kernel: [4662007.498403] mptscsih: ioc1: attempting host reset! (sc=ffff88011e632c00)
May 17 15:21:52 sub0 kernel: [4662007.498411] mptbase: ioc1: Initiating recovery
May 17 15:22:02 sub0 kernel: [4662016.680666] mptscsih: ioc1: host reset: SUCCESS (sc=ffff88011e632c00)
May 17 15:22:12 sub0 kernel: [4662026.686900] sd 2:0:0:0: Device offlined - not ready after error recovery
...
May 17 15:22:12 sub0 kernel: [4662026.687032] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662026.687034] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662026.687037] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 14 52 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662026.720494] lost page write due to I/O error on sdb1
...
May 17 15:22:12 sub0 kernel: [4662027.117326] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.117328] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.117331] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 2c ea
May 17 15:22:12 sub0 kernel: [4662027.117339] 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.122264] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.122266] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.122268] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 30 ea 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.125053] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.125055] sd 2:0:0:0: [sdb] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.125058] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 18 52 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.127869] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.127871] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.127874] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 18 10 62 00 03 e8 00
...
May 17 15:22:12 sub0 kernel: [4662027.130737] sd 2:0:0:0: [sdb] Unhandled error code
May 17 15:22:12 sub0 kernel: [4662027.405150] sd 2:0:0:0: [sdb] Result: hostbyte=DID_ERROR driverbyte=DRIVER_OK
May 17 15:22:12 sub0 kernel: [4662027.405152] sd 2:0:0:0: [sdb] CDB: Write(10): 2a 00 a8 17 34 ea 00 04 00 00
May 17 15:22:12 sub0 kernel: [4662027.410575] JBD: Detected IO errors while flushing file data on sdb1
May 17 15:22:13 sub0 kernel: [4662028.182860] JBD: Detected IO errors while flushing file data on sdb1
此时,阵列重新安装为只读。我不知道问题可能出在哪里(我对处理这种类型的光纤通道/RAID 阵列还比较陌生)
系统信息(如果我可以提供任何其他可能有用的信息,请告诉我)
sysadmin@sub0:~$ lspci(snipped to the relevant stuff I presume)
03:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068E PCI-Express Fusion-MPT SAS (rev 08)
05:00.0 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02)
05:00.1 Fibre Channel: LSI Logic / Symbios Logic FC949ES Fibre Channel Adapter (rev 02)
sysadmin@sub0:~$ cat /proc/mpt/summary
ioc0: LSIFC949E, FwRev=01031700h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:1B:89:14, IRQ=40
ioc1: LSISAS1068E B3, FwRev=00192f00h, Ports=1, MaxQ=266, IRQ=16
ioc2: LSIFC949E, FwRev=01031700h, Ports=1, MaxQ=1023, LanAddr=00:06:2B:1B:89:15, IRQ=50
sysadmin@sub0:~$ cat /proc/mpt/version
mptlinux-3.04.12
Fusion MPT base driver
Fusion MPT FC host driver
Fusion MPT SAS host driver
sysadmin@sub0:~$ cat /etc/issue
Ubuntu 10.04.2 LTS \n \l
完整的 /var/log/messages:https://gist.github.com/96df4b5b9ac7ec46f74c#file_messages
完整的 /var/log/kern.log:https://gist.github.com/96df4b5b9ac7ec46f74c#file_kern.log
感谢您花时间阅读并提供任何帮助。
答案1
了解更多有关 RAID 的实际配置方式(例如卷、大小、RAID 级别、条带和块大小等)以及您是否使用多路径的信息将会很有帮助。
您收到错误处理升级,因为中止的命令没有得到低级驱动程序和 SCSI 中间层的满意处理,这就是恢复严重性不断攀升的原因。它是如何开始的需要进行大量分析,例如记录 blktrace。根据这些非常有限的信息,我只能建议尝试使用 LTS 反向移植内核之一(例如 Oneiric)升级您的驱动程序并尝试重现问题;您使用的 mptsas 驱动程序非常旧。如果您仔细查看,您可能能够使用 DKMS 包更新该驱动程序。
如果问题仍然存在,那么您必须考虑自己是否有能力深入研究并解决此问题,而不是寻求操作系统供应商的额外支持。这些问题正是支持合同要解决的。无论你选择哪种方式,都要准备好投入数周而不是数天来确定根本原因。祝你好运。