CentOS 5.3 上的 QLE2562 HBA/qla2xxx 问题

CentOS 5.3 上的 QLE2562 HBA/qla2xxx 问题

我有几个 Linux 服务器(SunFire X4270),运行 CentOS 5.3(kernel-2.6.18-128.1.16.el5),带有 Qlogic FC-8 QLE2562 HBA...我在使用这些新服务器时遇到了很多问题,其中一个服务器每秒都会显示以下消息:

qla2xxx 0000:2f:00.0: Passthru CT request failed to login management server
qla2xxx 0000:2f:00.0: Passthru CT failed
qla2xxx 0000:2f:00.1: Passthru CT request failed to login management server
qla2xxx 0000:2f:00.1: Passthru CT failed

此外,我的几台服务器都出现以下问题(见下文)。我尝试了 CentOS 5.3 2.6.18-128.el5 和 2.6.18-128.1.16.el5(最新)的几个内核版本,还尝试了 Qlogic 的最新驱动程序(内嵌 4.06 版 QLE2562 固件),但都没有成功。奇怪的是,我还有另一台服务器,硬件/软件配置相同,运行良好(稳定……)。Sun 支持(这些服务器可用)尚未能够解决问题……有什么想法吗?

qla2xxx_eh_abort(8): aborting sp ffff81037d86ebc0 from RISC. pid=952 sp->state=7 q->q_flag=2
qla2xxx 0000:2f:00.1: Mailbox command timeout occurred. Issuing ISP abort.
NMI Watchdog detected LOCKUP on CPU 13
CPU 13
Modules linked in: autofs4 sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev qla2xxx(U) qla2xxx_conf(U) igb i2c_i801 intermodule(U) i2c_core sg pcspkr dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache ahci libata shpchp mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 2982, comm: scsi_eh_8 Tainted: G      2.6.18-128.el5 #1
RIP: 0010:[<ffffffff8000c6f2>]  [<ffffffff8000c6f2>] __delay+0x8/0x10
RSP: 0018:ffff81067dc7db88  EFLAGS: 00000097
RAX: 00000000ecd06b41 RBX: 000000000018c42b RCX: 00000000ecd05808
RDX: 0000000000000324 RSI: 0000000000000046 RDI: 0000000000003689
RBP: ffffc20000034000 R08: 0000000000000002 R09: ffff81067dc7db54
R10: 0000000000000001 R11: ffffffff80213fbd R12: ffff81037e84c4f8
R13: 0000000000000246 R14: 0000000000000001 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff81067fc46140(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000006bb424 CR3: 000000067d035000 CR4: 00000000000006e0
Process scsi_eh_8 (pid: 2982, threadinfo ffff81067dc7c000, task ffff81010c6ec040)
Stack:  ffffffff8827f743 ffff81037e84c4f8 ffff81067dc7dc90 ffff81060000dc20
 ffff81037fa461c8 ffff81037e84c4f8 ffff81067dc7dc90 0000000000000100
 ffffffff88285488 ffff81037fa461c8 ffff81037e84c4f8 ffff81067dc7dc90
Call Trace:
 [<ffffffff8827f743>] :qla2xxx:qla2x00_reset_chip+0x157/0x47e
 [<ffffffff88285488>] :qla2xxx:qla2x00_abort_isp+0x6c/0x70b
 [<ffffffff88286dfd>] :qla2xxx:qla2x00_mailbox_command+0x48e/0x553
 [<ffffffff88286960>] :qla2xxx:qla2x00_mbx_sem_timeout+0x0/0xf
 [<ffffffff882886f5>] :qla2xxx:qla2x00_issue_iocb_timeout+0x5f/0xc0
 [<ffffffff88288fd0>] :qla2xxx:qla24xx_abort_command+0xf9/0x1a5
 [<ffffffff88289099>] :qla2xxx:qla2x00_abort_command+0x1d/0x124
 [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff8827f1e6>] :qla2xxx:qla2xxx_eh_abort+0x9f8/0xba0
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8807919f>] :scsi_mod:scsi_error_handler+0x290/0x4ac
 [<ffffffff88078f0f>] :scsi_mod:scsi_error_handler+0x0/0x4ac
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032360>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032262>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 29 c8 48 39 f8 72 f5 c3 41 54 83 3d ad d8 3c 00 00 49 89 f4
Kernel panic - not syncing: nmi watchdog
 BUG: warning at kernel/panic.c:137/panic() (Tainted: G     )

Call Trace:
 <NMI>  [<ffffffff8008efff>] panic+0x1da/0x1eb
 [<ffffffff8006ba21>] _show_stack+0xdb/0xea
 [<ffffffff8006bb14>] show_registers+0xe4/0x100
 [<ffffffff8006537d>] die_nmi+0x66/0xa3
 [<ffffffff80065ac3>] nmi_watchdog_tick+0x157/0x1d3
 [<ffffffff800656e1>] default_do_nmi+0x81/0x225
 [<ffffffff8006594e>] do_nmi+0x43/0x61
 [<ffffffff80064fa7>] nmi+0x7f/0x88
 [<ffffffff80213fbd>] pci_mmcfg_read+0x0/0x92
 [<ffffffff8000c6f2>] __delay+0x8/0x10
 <<EOE>>  [<ffffffff8827f743>] :qla2xxx:qla2x00_reset_chip+0x157/0x47e
 [<ffffffff88285488>] :qla2xxx:qla2x00_abort_isp+0x6c/0x70b
 [<ffffffff88286dfd>] :qla2xxx:qla2x00_mailbox_command+0x48e/0x553
 [<ffffffff88286960>] :qla2xxx:qla2x00_mbx_sem_timeout+0x0/0xf
 [<ffffffff882886f5>] :qla2xxx:qla2x00_issue_iocb_timeout+0x5f/0xc0
 [<ffffffff88288fd0>] :qla2xxx:qla24xx_abort_command+0xf9/0x1a5
 [<ffffffff88289099>] :qla2xxx:qla2x00_abort_command+0x1d/0x124
 [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff8827f1e6>] :qla2xxx:qla2xxx_eh_abort+0x9f8/0xba0
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8807919f>] :scsi_mod:scsi_error_handler+0x290/0x4ac
 [<ffffffff88078f0f>] :scsi_mod:scsi_error_handler+0x0/0x4ac
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032360>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032262>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

BUG: warning at drivers/input/serio/i8042.c:846/i8042_panic_blink() (Tainted: G     )

Call Trace:
 <NMI>  [<ffffffff801fa015>] i8042_panic_blink+0x112/0x2a5
 [<ffffffff8008efa5>] panic+0x180/0x1eb
 [<ffffffff8006ba21>] _show_stack+0xdb/0xea
 [<ffffffff8006bb14>] show_registers+0xe4/0x100
 [<ffffffff8006537d>] die_nmi+0x66/0xa3
 [<ffffffff80065ac3>] nmi_watchdog_tick+0x157/0x1d3
 [<ffffffff800656e1>] default_do_nmi+0x81/0x225
 [<ffffffff8006594e>] do_nmi+0x43/0x61
 [<ffffffff80064fa7>] nmi+0x7f/0x88
 [<ffffffff80213fbd>] pci_mmcfg_read+0x0/0x92
 [<ffffffff8000c6f2>] __delay+0x8/0x10
 <<EOE>>  [<ffffffff8827f743>] :qla2xxx:qla2x00_reset_chip+0x157/0x47e
 [<ffffffff88285488>] :qla2xxx:qla2x00_abort_isp+0x6c/0x70b
 [<ffffffff88286dfd>] :qla2xxx:qla2x00_mailbox_command+0x48e/0x553
 [<ffffffff88286960>] :qla2xxx:qla2x00_mbx_sem_timeout+0x0/0xf
 [<ffffffff882886f5>] :qla2xxx:qla2x00_issue_iocb_timeout+0x5f/0xc0
 [<ffffffff88288fd0>] :qla2xxx:qla24xx_abort_command+0xf9/0x1a5
 [<ffffffff88289099>] :qla2xxx:qla2x00_abort_command+0x1d/0x124
 [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff8827f1e6>] :qla2xxx:qla2xxx_eh_abort+0x9f8/0xba0
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8807919f>] :scsi_mod:scsi_error_handler+0x290/0x4ac
 [<ffffffff88078f0f>] :scsi_mod:scsi_error_handler+0x0/0x4ac
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032360>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032262>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

BUG: warning at drivers/input/serio/i8042.c:849/i8042_panic_blink() (Tainted: G     )

Call Trace:
 <NMI>  [<ffffffff801fa0fe>] i8042_panic_blink+0x1fb/0x2a5
 [<ffffffff8008efa5>] panic+0x180/0x1eb
 [<ffffffff8006ba21>] _show_stack+0xdb/0xea
 [<ffffffff8006bb14>] show_registers+0xe4/0x100
 [<ffffffff8006537d>] die_nmi+0x66/0xa3
 [<ffffffff80065ac3>] nmi_watchdog_tick+0x157/0x1d3
 [<ffffffff800656e1>] default_do_nmi+0x81/0x225
 [<ffffffff8006594e>] do_nmi+0x43/0x61
 [<ffffffff80064fa7>] nmi+0x7f/0x88
 [<ffffffff80213fbd>] pci_mmcfg_read+0x0/0x92
 [<ffffffff8000c6f2>] __delay+0x8/0x10
 <<EOE>>  [<ffffffff8827f743>] :qla2xxx:qla2x00_reset_chip+0x157/0x47e
 [<ffffffff88285488>] :qla2xxx:qla2x00_abort_isp+0x6c/0x70b
 [<ffffffff88286dfd>] :qla2xxx:qla2x00_mailbox_command+0x48e/0x553
 [<ffffffff88286960>] :qla2xxx:qla2x00_mbx_sem_timeout+0x0/0xf
 [<ffffffff882886f5>] :qla2xxx:qla2x00_issue_iocb_timeout+0x5f/0xc0
 [<ffffffff88288fd0>] :qla2xxx:qla24xx_abort_command+0xf9/0x1a5
 [<ffffffff88289099>] :qla2xxx:qla2x00_abort_command+0x1d/0x124
 [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff8827f1e6>] :qla2xxx:qla2xxx_eh_abort+0x9f8/0xba0
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8807919f>] :scsi_mod:scsi_error_handler+0x290/0x4ac
 [<ffffffff88078f0f>] :scsi_mod:scsi_error_handler+0x0/0x4ac
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032360>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032262>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

BUG: warning at drivers/input/serio/i8042.c:851/i8042_panic_blink() (Tainted: G     )

Call Trace:
 <NMI>  [<ffffffff801fa17b>] i8042_panic_blink+0x278/0x2a5
 [<ffffffff8008efa5>] panic+0x180/0x1eb
 [<ffffffff8006ba21>] _show_stack+0xdb/0xea
 [<ffffffff8006bb14>] show_registers+0xe4/0x100
 [<ffffffff8006537d>] die_nmi+0x66/0xa3
 [<ffffffff80065ac3>] nmi_watchdog_tick+0x157/0x1d3
 [<ffffffff800656e1>] default_do_nmi+0x81/0x225
 [<ffffffff8006594e>] do_nmi+0x43/0x61
 [<ffffffff80064fa7>] nmi+0x7f/0x88
 [<ffffffff80213fbd>] pci_mmcfg_read+0x0/0x92
 [<ffffffff8000c6f2>] __delay+0x8/0x10
 <<EOE>>  [<ffffffff8827f743>] :qla2xxx:qla2x00_reset_chip+0x157/0x47e
 [<ffffffff88285488>] :qla2xxx:qla2x00_abort_isp+0x6c/0x70b
 [<ffffffff88286dfd>] :qla2xxx:qla2x00_mailbox_command+0x48e/0x553
 [<ffffffff88286960>] :qla2xxx:qla2x00_mbx_sem_timeout+0x0/0xf
 [<ffffffff882886f5>] :qla2xxx:qla2x00_issue_iocb_timeout+0x5f/0xc0
 [<ffffffff88288fd0>] :qla2xxx:qla24xx_abort_command+0xf9/0x1a5
 [<ffffffff88289099>] :qla2xxx:qla2x00_abort_command+0x1d/0x124
 [<ffffffff80064c08>] _spin_unlock_irqrestore+0x8/0x9
 [<ffffffff8827f1e6>] :qla2xxx:qla2xxx_eh_abort+0x9f8/0xba0
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff8807919f>] :scsi_mod:scsi_error_handler+0x290/0x4ac
 [<ffffffff88078f0f>] :scsi_mod:scsi_error_handler+0x0/0x4ac
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032360>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff8009d909>] keventd_create_kthread+0x0/0xc4
 [<ffffffff80032262>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11

答案1

如果qla2xxx 0000:2f:00.0: Passthru CT request failed to login management server仅在一台服务器上附加,则可能是卡的硬件问题。您是否尝试将此卡放在另一台服务器上?
对于运行良好的服务器,我会通过将卡从服务器 A 放到服务器 B 来尝试相同的测试,看看服务器 B 是否开始稳定或服务器 A 是否仍然稳定。

答案2

谢谢 radius。这似乎Passthru CT request failed是硬件问题(尚未完全验证)。对于另一个大问题,它与我们拥有的 PCIe Active Riser 卡(Sun X4270 硬件)有关:这些卡包含与 QLE2562 冲突的 PCIe 交换机(问题已由 Sun 支持级别 2 验证/重现)... 如果您在使用 Sun 硬件时遇到此问题,请尝试将 HBA 放在未切换的 PCIe 插槽中(X4270 上的插槽 0 和 3,因为 Riser 0 不是活动 Riser,它位于 16x 插槽上)。Sun 正在努力解决其机器上的问题,以允许用户将 HBA 放在任何插槽中。

答案3

qla2xxx_eh_abort(8):aborting sp。此问题完全与安装在 Sun Blade 服务器上的 HBA 卡有关。实际上,我们最近在 2012 年 12 月 16 日就遇到了这个问题。因此,请更换 HBA 卡,这样就可以完全解决问题。

相关内容