每当系统生成 EDAC 错误时,我都需要调用脚本。
为此,我创建了以下 UDEV 规则。如果ce_count
发生更改,那么我想执行/var/tmp/test.sh
,然后我执行了,udevadm control --reload-rules && udevadm trigger
并且udevadm monitor
还引发了错误,mce-inect
但脚本没有执行。
# cat /etc/udev/rules.d/98-edac.rules
ACTION=="change", ATTR{ce_count}, KERNEL=="mc0", RUN+="/var/tmp/test.sh"
# udevadm info -ap /sys/devices/system/edac/mc/mc0
Udevadm info starts with the device specified by the devpath and then
walks up the chain of parent devices. It prints for every device
found, all possible attributes in the udev rules key format.
A rule to match, can be composed by the attributes of the device
and the attributes from one single parent device.
looking at device '/devices/system/edac/mc/mc0':
KERNEL=="mc0"
SUBSYSTEM=="mc0"
DRIVER==""
ATTR{ce_count}=="21"
ATTR{ce_noinfo_count}=="0"
ATTR{max_location}=="channel 7 slot 2 "
ATTR{mc_name}=="Broadwell Socket#0"
ATTR{seconds_since_reset}=="5223"
ATTR{size_mb}=="65536"
ATTR{ue_count}=="0"
ATTR{ue_noinfo_count}=="0"
looking at parent device '/devices/system/edac/mc':
KERNELS=="mc"
SUBSYSTEMS=="edac"
DRIVERS==""
looking at parent device '/devices/system/edac':
KERNELS=="edac"
SUBSYSTEMS==""
DRIVERS==""
我使用以下方法诱发 edac/mce 故障mce-inject
:
./mce-inject ./basic-inject.txt
# cat basic-inject.txt
CPU 0 BANK 8
STATUS corrected
ADDR 0x12345125
MCGCAP 0x7000c16
APICID 0
MCGSTATUS 0
SOCKETID 0
MISC 0x50683286
STATUS 0x8c00004000010090
插入错误后内核 syslog/dmesg 有日志条目
[ +4.436747] Starting machine check poll CPU 0
[ +0.000013] mce: [Hardware Error]: Machine check events logged
[ +0.000008] Machine check poll done on CPU 0
[ +0.000030] EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
[ +0.000002] EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 8: 8c00004000010090
[ +0.000001] EDAC sbridge MC0: TSC 0
[ +0.000002] EDAC sbridge MC0: ADDR 12345100
[ +0.000000] EDAC sbridge MC0: MISC 50683286
[ +0.000002] EDAC sbridge MC0: PROCESSOR 0:406f1 TIME 1593625089 SOCKET 0 APIC 0
[ +0.000005] EDAC DEBUG: get_memory_error_data: SAD interleave package: 0 = CPU socket 0, HA 0, shiftup: 1
[ +0.000005] EDAC DEBUG: get_memory_error_data: TAD#0: address 0x0000000012345100 < 0x000000007fffffff, socket interleave 0, channel interleave 2 (offset 0x00000000), index 0, base ch: 2, ch mask: 0x04
[ +0.000007] EDAC DEBUG: get_memory_error_data: RIR#0, limit: 31.999 GB (0x00000007ffffffff), way: 4
[ +0.000002] EDAC DEBUG: get_memory_error_data: RIR#0: channel address 0x091a2880 < 0x7ffffffff, RIR interleave 2, index 1
[ +0.000002] EDAC DEBUG: sbridge_mce_output_error: area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:4 rank:4
[ +0.000007] EDAC MC0: 1 CE memory read error on CPU_SrcID#0_Ha#0_Chan#2_DIMM#1 (channel:2 slot:1 page:0x12345 offset:0x100 grain:32 syndrome:0x0 - area:DRAM err_code:0001:0090 socket:0 ha:0 channel_mask:4 rank:4)
[Jul 1 17:41] perf: interrupt took too long (3923 > 3920), lowering kernel.perf_event_max_sample_rate to 50000