为 mcelog 编写触发器

为 mcelog 编写触发器

刚刚开始第一次研究mcelog(我之前已经启用了它并看到了系统日志输出,但这是我第一次尝试做一些非默认的事情)。我正在寻找有关如何为其编写触发器的信息。具体来说,我正在寻找mcelog可以对哪些类型的事件做出反应,它如何决定执行哪些脚本等等。我可以从示例触发器中得到的最好结果是,它在调用脚本之前设置了一堆环境变量。那么它是否只是尝试执行触发器目录(位于/etc/mcelogRHEL 上)中的所有内容并让脚本决定它要执行的操作?

我见过其他名称看起来像 MCE 事件的触发器脚本,这是惯例还是有特殊功能?我创建了一个名为的触发器/etc/mcelog/joel.sh,它只向我的 Gmail 帐户发送一封基本电子邮件。几天前,显然触发器被触发了,因为我在没有手动运行脚本的情况下收到了来自脚本的电子邮件。我没想到将env输出通过管道传输到mailx命令,joel.sh所以我不知道什么硬件事件触发了脚本执行,也不知道为什么mcelog选择joel.sh作为脚本来执行它。

基本上,我正在寻找一个答案,它可以为我提供基本的方向mcelog,它的触发系统,以及如何使用它来监控我的硬件运行状况。我很确定一旦我掌握了方向,我就能弄清楚更高级的东西。

答案1

查看示例mcelog.conf配置文件,它看起来包含大多数(如果不是全部)它可以处理的触发器类型。

DIMM

[dimm]
#
# execute these triggers when the rate of corrected or uncorrected
# errors per DIMM exceeds the threshold
# Note when the hardware does not report DIMMs this might also
# be per channel
# The default of 10/24h is reasonable for server quality·
# DDR3 DIMMs as of 2009/10
#uc-error-trigger = dimm-error-trigger
uc-error-threshold = 1 / 24h
#ce-error-trigger = dimm-error-trigger
ce-error-threshold = 10 / 24h

插座

[socket]
# Threshold and trigger for uncorrected memory errors on a socket
# mem-uc-error-trigger = socket-memory-error-trigger
mem-uc-error-threshold = 100 / 24h
# Threshold and trigger for corrected memory errors on a socket
mem-ce-error-trigger = socket-memory-error-trigger
mem-ce-error-threshold = 100 / 24h

缓存

[cache]
# Processing of cache error thresholds reported by Intel CPUs
cache-threshold-trigger = cache-error-trigger

[page]
# Memory error accouting per 4K memory page
# Threshold for the correct memory errors trigger script
memory-ce-threshold = 10 / 24h
# Trigger script for corrected errors
# memory-ce-trigger = page-error-trigger

触发器

可以在此部分控制触发器。

[trigger]
# Maximum number of running triggers
children-max = 2
# execute triggers in this directory
directory = /etc/mcelog

触发器示例

一些示例触发器在 mcelog github 页面上。

示例触发脚本dimm-error-triggers

#!/bin/sh
#  This shell script can be executed by mcelog in daemon mode when a DIMM
#  exceeds a pre-configured error threshold
# 
# environment:
# THRESHOLD     human readable threshold status
# MESSAGE   Human readable consolidated error message
# TOTALCOUNT    total count of errors for current DIMM of CE/UC depending on
#       what triggered the event
# LOCATION  Consolidated location as a single string
# DMI_LOCATION  DIMM location from DMI/SMBIOS if available
# DMI_NAME  DIMM identifier from DMI/SMBIOS if available
# DIMM      DIMM number reported by hardware
# CHANNEL   Channel number reported by hardware
# SOCKETID  Socket ID of CPU that includes the memory controller with the DIMM
# CECOUNT   Total corrected error count for DIMM
# UCCOUNT   Total uncorrected error count for DIMM
# LASTEVENT Time stamp of event that triggered threshold (in time_t format, seconds)
# THRESHOLD_COUNT Total umber of events in current threshold time period of specific type
#
# note: will run as mcelog configured user
# this can be changed in mcelog.conf

logger -s -p daemon.err -t mcelog "$MESSAGE"
logger -s -p daemon.err -t mcelog "Location: $LOCATION"

[ -x ./dimm-error-trigger.local ] && . ./dimm-error-trigger.local

exit 0

参考

相关内容