我们有 RHEL 7 服务器,从 dmesg 日志中我们可以看到以下详细信息
[13901018.980859] {9}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[13901018.980868] {9}[Hardware Error]: It has been corrected by h/w and requires no further action
[13901018.980870] {9}[Hardware Error]: event severity: corrected
[13901018.980872] {9}[Hardware Error]: Error 0, type: corrected
[13901018.980873] {9}[Hardware Error]: fru_text: A8
[13901018.980875] {9}[Hardware Error]: section_type: memory error
[13901018.980876] {9}[Hardware Error]: error_status: 0x0000000000000400
[13901018.980878] {9}[Hardware Error]: physical_address: 0x0000000ffd6bb600
[13901018.980880] {9}[Hardware Error]: node: 0 card: 3 module: 1 rank: 1 bank: 2 row: 30682 column: 728
[13901018.980882] {9}[Hardware Error]: error_type: 2, single-bit ECC
[13901018.980899] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[13901018.980901] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[13901018.980903] EDAC sbridge MC0: TSC 89ad682bcacc05
[13901018.980905] EDAC sbridge MC0: ADDR ffd6bb600
[13901018.980906] EDAC sbridge MC0: MISC 0
[13901018.980907] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1575370818 SOCKET 0 APIC 0
[13901019.271775] EDAC MC0: 0 CE memory read error on CPU_SrcID#0_Ha#1_Chan#1_DIMM#1 (channel:5 slot:1 page:0xffd6bb offset:0x600 grain:32 syndrome:0x0 - area:DRAM err_code:0000:009f socket:0 ha:1 channel_mask:2 rank:5)
[13901059.217841] mce: [Hardware Error]: Machine check events logged
[13903720.090431] {10}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 4
[13903720.090435] {10}[Hardware Error]: It has been corrected by h/w and requires no further action
[13903720.090436] {10}[Hardware Error]: event severity: corrected
[13903720.090438] {10}[Hardware Error]: Error 0, type: corrected
[13903720.090439] {10}[Hardware Error]: fru_text: A8
[13903720.090440] {10}[Hardware Error]: section_type: memory error
[13903720.090441] {10}[Hardware Error]: error_status: 0x0000000000000400
[13903720.090442] {10}[Hardware Error]: physical_address: 0x0000000ffe47b640
[13903720.090445] {10}[Hardware Error]: node: 0 card: 3 module: 1 rank: 1 bank: 2 row: 30705 column: 728
[13903720.090446] {10}[Hardware Error]: error_type: 2, single-bit ECC
[13903720.090456] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[13903720.090458] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 1: 940000000000009f
[13903720.090459] EDAC sbridge MC0: TSC 89b2cfb1432fce
[13903720.090460] EDAC sbridge MC0: ADDR ffe47b640
[13903720.090461] EDAC sbridge MC0: MISC 0
从谷歌搜索来看,我们的调光卡似乎有问题,但仍然不相信这一点
对上述内核消息有什么意见吗?
其他详细信息dmesg
(但与网络驱动程序有关,可能还与 DIMM 卡有关)
[81712386.762144] i40e 0000:82:00.0 p4p1: tx_timeout: VSI_seid: 395, Q 47, NTC: 0x19a, HWB: 0x19a, NTU: 0x182, TAIL: 0x19a, INT: 0x1
[81712386.762145] i40e 0000:82:00.0 p4p1: tx_timeout recovery level 1, hung_queue 47
[89254950.070885] traps: polkitd[111181] general protection ip:7f4d643b8cf2 sp:7fff401879c0 error:0 in libmozjs-17.0.so[7f4d6427a000+3b3000]
[90620196.068233] INFO: task kworker/15:2:76449 blocked for more than 120 seconds.
[90620196.068237] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[90620196.068239] kworker/15:2 D ffff88027c533dd8 0 76449 2 0x00000080
[90620196.068247] ffff88027c533bf0 0000000000000046 ffff8826eff68000 ffff88027c533fd8
[90620196.068249] ffff88027c533fd8 ffff88027c533fd8 ffff8826eff68000 ffff88027c533d58
[90620196.068251] ffff88027c533d60 7fffffffffffffff ffff8826eff68000 ffff88027c533dd8