我的操作系统出现内核恐慌(它看起来触发了另一个内核转储,kdump?)
[ 124.674715] core: Uncorrected hardware memory error in user-access at xxxxxxx
[ 124.684140] BUG: scheduling while atomic: einj_mem_uc/5151/0xxxxxxxxx
[ 124.684310] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
r = 0xxxxxxxxxxx[ 124.691839] Memory failure: 0x25eae3: Killing einj_mem_uc:6161 due to hardware memory corruption
[ 124.700827] {1}[Hardware Error]: event severity: recoverable
[ 124.700828] {1}[Hardware Error]: Error 0, type: recoverable
00 paddr = xxxxx[ 124.700829] {1}[Hardware Error]: fru_text: Card01, ChnE, DIMM0
[ 124.700830] {1}[Hardware Error]: section_type: memory error
[ 124.700835] {1}[Hardware Error]: error_status: 0x0000000000000400
[ 124.712309] Memory failure: 0x25eae3: recovery action for dirty LRU page: Recovered
[ 124.718713] {1}[Hardware Error]: physical_address: 0x000000015ace3400
[ 124.718715] {1}[Hardware Error]: node: 0 card: 4 module: 0 rank: 0 bank: 21 device: 0 row: 10455 column: 1408
[ 124.718716] {1}[Hardware Error]: error_type: 4, single-symbol chipkill ECC
[ 124.718718] {1}[Hardware Error]: DIMM location: _Node0_Channel4_Dimm0 CPU0_E0
[ 124.791089] Memory failure: 0x25eae3: already hardware poisoned
3 116
400
[ 0.000000] Linux version 4.18.0-348.el8.x86_64
我检查了源代码:
https://elixir.bootlin.com/linux/v4.18/source/kernel/sched/core.c#L3287
操作系统应该只在panic_on_warn == 1时恐慌,但我检查了我的操作系统:
sudo sysctl -a | grep -i panic_on
...
kernel.panic_on_warn = 0
答案1
好吧,只是为了确认我的评论,感谢您提供的补充信息:
内核不会因为以下原因而恐慌BUG:原子调度(正如 所预期的那样kernel.panic_on_warn = 0
,这不是恐慌的有效理由)但更明显的是由于反复出现硬件内存故障由 MCE 中断处理程序检测到,并且可能是该处理程序中某些致命问题的根源。