为什么当panic_on_warn==0时内核会出现恐慌

为什么当panic_on_warn==0时内核会出现恐慌

我的操作系统出现内核恐慌(它看起来触发了另一个内核转储,kdump?)

[   124.674715] core: Uncorrected hardware memory error in user-access at xxxxxxx
[   124.684140] BUG: scheduling while atomic: einj_mem_uc/5151/0xxxxxxxxx
[   124.684310] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
r = 0xxxxxxxxxxx[   124.691839] Memory failure: 0x25eae3: Killing einj_mem_uc:6161 due to hardware memory corruption
[   124.700827] {1}[Hardware Error]: event severity: recoverable
[   124.700828] {1}[Hardware Error]:  Error 0, type: recoverable
00 paddr = xxxxx[   124.700829] {1}[Hardware Error]:  fru_text: Card01, ChnE, DIMM0
[   124.700830] {1}[Hardware Error]:   section_type: memory error
[   124.700835] {1}[Hardware Error]:   error_status: 0x0000000000000400
[   124.712309] Memory failure: 0x25eae3: recovery action for dirty LRU page: Recovered
[   124.718713] {1}[Hardware Error]:   physical_address: 0x000000015ace3400
[   124.718715] {1}[Hardware Error]:   node: 0 card: 4 module: 0 rank: 0 bank: 21 device: 0 row: 10455 column: 1408 
[   124.718716] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[   124.718718] {1}[Hardware Error]:   DIMM location: _Node0_Channel4_Dimm0 CPU0_E0 
[   124.791089] Memory failure: 0x25eae3: already hardware poisoned
3 116
400
[    0.000000] Linux version 4.18.0-348.el8.x86_64 

我检查了源代码:

https://elixir.bootlin.com/linux/v4.18/source/kernel/sched/core.c#L3287

操作系统应该只在panic_on_warn == 1时恐慌,但我检查了我的操作系统:

sudo sysctl -a | grep -i panic_on
...
kernel.panic_on_warn = 0

答案1

好吧,只是为了确认我的评论,感谢您提供的补充信息:

内核不会因为以下原因而恐慌BUG:原子调度(正如 所预期的那样kernel.panic_on_warn = 0,这不是恐慌的有效理由)但更明显的是由于反复出现硬件内存故障由 MCE 中断处理程序检测到,并且可能是该处理程序中某些致命问题的根源。

相关内容