我们有 524 台 RHEL 机器。在我们的 Hadoop 集群中(所有机器都是 DELL HW),所有机器都是 RHEL 7.2 版本(旧内核版本)
uname -r
3.10.0-327.el7.x86_64
上周我们在 64 台机器上看到了以下内核消息。
[Wed Mar 15 00:45:11 2023] i40e 0000:81:00.0 pap3: VSI_seid 388, Hung TX queue 43, tx_pending_hw: 3, NTC:0x90, HWB: 0x99, NTU: 0x9c, TAIL: 0x9c
[Wed Mar 15 00:45:11 2023] i40e 0000:81:00.0 pap3: VSI_seid 388, Issuing force_wb for TX queue 43, Interrupt Reg: 0x0
以上内核消息让我想到了内核升级或 RHEL 升级从 7.2 到 7.9
据我们了解,在所有机器上升级到 RHEL 7.9 是一项艰巨的任务,而且需要时间
但由于这里描述的信息不太清楚,
那么我将很感激得到其他人的意见。
有关 dmesg 输出的更多详细信息请参见此处
[Thu Mar 9 16:27:14 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0058--- SOCKET 0 APIC 0
[Thu Mar 9 16:27:15 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x284b463 offset:0xa80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0)
[Thu Mar 9 16:27:37 2023] mce: [Hardware Error]: Machine check events logged
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: event severity: corrected
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: Error 0, type: corrected
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: fru_text: A1
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: section_type: memory error
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: error_status: 0x---DIGITS_0038---
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: physical_address: 0x---DIGITS_0057---b467b80
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 3 row: 15545 column: 1000
[Thu Mar 9 16:47:01 2023] {12}[Hardware Error]: error_type: 2, single-bit ECC
[Thu Mar 9 16:47:01 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 9 16:47:01 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: ---DIGITS_0039---f
[Thu Mar 9 16:47:01 2023] EDAC sbridge MC0: TSC b656932cc18c
[Thu Mar 9 16:47:01 2023] EDAC sbridge MC0: ADDR 284b467b80
[Thu Mar 9 16:47:01 2023] EDAC sbridge MC0: MISC 0
[Thu Mar 9 16:47:01 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0059--- SOCKET 0 APIC 0
[Thu Mar 9 16:47:01 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x284b467 offset:0xb80 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0)
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: event severity: corrected
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: Error 0, type: corrected
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: fru_text: A1
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: section_type: memory error
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: error_status: 0x---DIGITS_0038---
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: physical_address: 0x---DIGITS_0057---b465180
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: node: 0 card: 0 module: 0 rank: 0 bank: 3 row: 15545 column: 832
[Thu Mar 9 16:47:09 2023] {13}[Hardware Error]: error_type: 2, single-bit ECC
[Thu Mar 9 16:47:09 2023] mce: [Hardware Error]: Machine check events logged
[Thu Mar 9 16:47:09 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 9 16:47:09 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: ---DIGITS_0039---f
[Thu Mar 9 16:47:09 2023] EDAC sbridge MC0: TSC b65ab992762c
[Thu Mar 9 16:47:09 2023] EDAC sbridge MC0: ADDR 284b465180
[Thu Mar 9 16:47:09 2023] EDAC sbridge MC0: MISC 0
[Thu Mar 9 16:47:09 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0060--- SOCKET 0 APIC 0
[Thu Mar 9 16:47:10 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x284b465 offset:0x180 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:0)
[Thu Mar 9 16:47:37 2023] mce: [Hardware Error]: Machine check events logged
[Thu Mar 9 16:54:47 2023] perf: interrupt took too long (587547 > 458393), lowering kernel.perf_event_max_sample_rate to 1000
[Thu Mar 9 19:04:47 2023] INFO: NMI handler (ghes_notify_nmi) took too long to run: 761611.066 msecs
[Thu Mar 9 19:08:06 2023] INFO: NMI handler (ghes_notify_nmi) took too long to run: 418088.094 msecs
[Thu Mar 9 19:23:55 2023] INFO: NMI handler (ghes_notify_nmi) took too long to run: 377227.104 msecs
[Thu Mar 9 19:59:52 2023] hrtimer: interrupt took ---DIGITS_0061--- ns
[Thu Mar 9 20:32:02 2023] perf: interrupt took too long (998530 > 734433), lowering kernel.perf_event_max_sample_rate to 1000
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: It has been corrected by h/w and requires no further action
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: event severity: corrected
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: Error 0, type: corrected
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: fru_text: A5
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: section_type: memory error
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: error_status: 0x---DIGITS_0038---
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: physical_address: 0x0000001ce008b940
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: node: 0 card: 0 module: 1 rank: 0 bank: 0 row: 58882 column: 224
[Thu Mar 9 20:35:25 2023] {14}[Hardware Error]: error_type: 2, single-bit ECC
[Thu Mar 9 20:35:25 2023] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[Thu Mar 9 20:35:25 2023] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: ---DIGITS_0039---f
[Thu Mar 9 20:35:25 2023] EDAC sbridge MC0: TSC d1c21b47e1a2
[Thu Mar 9 20:35:25 2023] EDAC sbridge MC0: ADDR 1ce008b940
[Thu Mar 9 20:35:25 2023] EDAC sbridge MC0: MISC 0
[Thu Mar 9 20:35:25 2023] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME ---DIGITS_0062--- SOCKET 0 APIC 0
[Thu Mar 9 20:35:26 2023] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1ce008b offset:0x940 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:0 channel_mask:1 rank:4)
[Thu Mar 9 20:37:37 2023] mce: [Hardware Error]: Machine check events logged
注释和要点
这些 i40e TX 队列错误可能来自 Intel i40e 驱动程序,我发现了几个对同一错误的引用。当前版本 2.22.18 发布于 2023 年 2 月 14 日
也许我们可以确定 i40e 是消息的来源,并在网上搜索 i40e“Hung TX 队列”和 i40“Issuing force_wb for TX 队列”。我找到的结果日期为 2015 - 2017 年,我得到的结论是,这是一个驱动程序故障,可能是一个错误。然后我检查了英特尔提供的内容,并将我的结果提供给您。您需要验证它们是否适用于您的情况并决定行动方案
i40e 是一个内核驱动程序,因此内核更新也可能更新该驱动程序。如果想要最新版本,英特尔还提供安装说明 - 最新版本不会包含在任何内核中。
参考