/var/log/syslog 包含:
Jul 31 13:45:01 ray-desktop CRON[5667]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 31 13:45:50 ray-desktop org.gnome.Shell.desktop[1689]: [2036:2054:0731/134550.778035:ERROR:socket_stream.cc(219)] Closing stream with result -2
Jul 31 13:47:51 ray-desktop rasdaemon[695]: <...>-35 [-41071872] 0.001327: mce_record: 2019-07-31 12:27:04 -0400 bank=8, status= 8c2001000001110b, corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error, mci=Corrected_error Threshold based error status: green, mca=corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error Large number of corrected cache errors. System operating, but might leadto uncorrected errors soon, cpu_type= Intel generic architectural MCA, cpu= 0, socketid= 0, misc= 31c0, addr= 2cee80000075b7d, mcgstatus=0, mcgcap= c09, apicid= 0
Jul 31 13:47:51 ray-desktop kernel: [18114.699831] mce: [Hardware Error]: Machine check events logged
Jul 31 13:47:51 ray-desktop rasdaemon[695]: cpu 00:rasdaemon: mce_record store: 0x556ec46df398
Jul 31 13:47:51 ray-desktop rasdaemon[695]: rasdaemon: register inserted at db
Jul 31 13:48:22 ray-desktop kernel: [18145.544922] perf: interrupt took too long (5187 > 5062), lowering kernel.perf_event_max_sample_rate to 38500
紧接着是 13:55:53 的重启日志。
据我所知,“mce”日志记录已被“rasdaemon”取代,两者都在上面提到过。
$ find /sys/kernel/debug/tracing -type f \! -empty
什么也没找到。
该目录中有超过 22,000 个文件,全部为空,并且都是在重新启动时创建的。
这是 rasdaemon 保存其信息的地方吗?如果是的话,如果重启后所有信息都归零,那么它有什么用?
答案1
下面的所有内容/sys
通常都是内核的虚拟文件系统,特别/sys/kernel/debug/tracing
是踪迹。这与 无关rasdaemon
。
如果以参数/rasdaemon
启动,它会将事件存储在 Sqlite3 数据库中,该数据库在我的系统上位于。可以使用 来检查该数据库。-r
--record
/var/lib/rasdaemon/ras-mc_event.db
ras-mc-ctl --errors
答案2
- rasdaemon 的日志通过 syslog/journald 报告。
rasdaemon 程序是一个守护进程,用于监视来自 Linux 内核跟踪事件的平台可靠性、可用性和可服务性 (RAS) 报告。这些跟踪事件记录在 /sys/kernel/debug/tracing 中,并通过 syslog/journald 报告它们。
https://github.com/mchehab/rasdaemon/blob/master/man/rasdaemon.1.in
您可以通过journalctl获取日志。
#journalctl | tail -n 100
Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: <idle>-0 [-85410864] 0.000960: mc_event: 2023-07-12 20:24:45 +0800 1 Corrected error: single-symbol chipkill ECC on unknown memory (mc: 0 address: 0x400abb3a400 grain: 0 APEI location: node:0 card:5 module:0 rank:1 bank_group:0 bank_address:3 device:0 row:174 column:1280 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory)
Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: cpu 19:rasdaemon: mc_event store: 0xaaaab9491ff8
Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: rasdaemon: register inserted at db
- ras 事件是内核发出的跟踪点,您可以自己通过 debugfs 监视它们。
# ls /sys/kernel/debug/tracing/events/ras/mc_event/
enable filter format hist id trigger
#cat /sys/kernel/debug/tracing/events/ras/mc_event/id
1188
# cd /sys/kernel/debug/tracing/events/ras/mc_event/
#echo 1 > enable
# cd /sys/kernel/debug/tracing/
# cat trace_pipe
<idle>-0 [074] dnh. 7251.551618: mc_event: 1 Corrected error: Single-symbol ChipKill ECC on unknown memory (mc:0 location:-1:-1:-1 address:0x40098e20900 grain:1 syndrome:0x00000000 APEI location: node:0 card:4 module:0 rank:1 bank_group:2 bank_address:0 row:99 col:64 chipID: 0 status(0x0000000000000400): Storage error in DRAM memory)
# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1 #P:128
#
# _-----=> irqs-off
# / _----=> need-resched
# | / _---=> hardirq/softirq
# || / _--=> preempt-depth
# ||| / delay
# TASK-PID CPU# |||| TIMESTAMP FUNCTION
# | | | |||| | |
<idle>-0 [075] d.h. 7323.829675: mc_event: 1 Corrected error: Single-symbol ChipKill ECC on unknown memory (mc:0 location:-1:-1:-1 address:0x40098e20900 grain:1 syndrome:0x00000000 APEI location: node:0 card:4 module:0 rank:1 bank_group:2 bank_address:0 row:99 col:64 chipID: 0 status(0x0000000000000400): Storage error in DRAM memory)
- 跟踪点由 rasdeamon 监控,如果以参数 启动,则最终持久存储在 Sqlite3 数据库中
-r/--record
。
#systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; disabled; vendor preset: disabled)
Active: active (running) since Wed 2023-07-12 15:40:42 CST; 3s ago
Process: 40597 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
Main PID: 40596 (rasdaemon)
Tasks: 1
Memory: 440.0K
CGroup: /system.slice/rasdaemon.service
└─40596 /usr/sbin/rasdaemon -f -r
#ras-mc-ctl --errors
Memory controller events:
1 2023-07-12 15:42:21 +0800 1 Info error(s): memory read error at CPU_SrcID#0_MC#0_Chan#0_DIMM#0 location: 0:0:0:-1, xxxx
No Extlog errors.
PCIe AER events:
1 2023-07-12 17:00:56 +0800 Corrected error: Data Link Protocol
MCE events:
1 2023-07-12 15:42:21 +0800 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg xxxx