rasdaemon 在哪里记录其日志?

rasdaemon 在哪里记录其日志?

/var/log/syslog 包含:

Jul 31 13:45:01 ray-desktop CRON[5667]: (root) CMD (command -v debian-sa1 > /dev/null && debian-sa1 1 1)
Jul 31 13:45:50 ray-desktop org.gnome.Shell.desktop[1689]: [2036:2054:0731/134550.778035:ERROR:socket_stream.cc(219)] Closing stream with result -2
Jul 31 13:47:51 ray-desktop rasdaemon[695]:            <...>-35    [-41071872]     0.001327: mce_record:           2019-07-31 12:27:04 -0400 bank=8, status= 8c2001000001110b, corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error, mci=Corrected_error Threshold based error status: green, mca=corrected filtering (some unreported errors in same region) Generic CACHE Level-3 Generic Error Large number of corrected cache errors. System operating, but might leadto uncorrected errors soon, cpu_type= Intel generic architectural MCA, cpu= 0, socketid= 0, misc= 31c0, addr= 2cee80000075b7d, mcgstatus=0, mcgcap= c09, apicid= 0
Jul 31 13:47:51 ray-desktop kernel: [18114.699831] mce: [Hardware Error]: Machine check events logged
Jul 31 13:47:51 ray-desktop rasdaemon[695]: cpu 00:rasdaemon: mce_record store: 0x556ec46df398
Jul 31 13:47:51 ray-desktop rasdaemon[695]: rasdaemon: register inserted at db
Jul 31 13:48:22 ray-desktop kernel: [18145.544922] perf: interrupt took too long (5187 > 5062), lowering kernel.perf_event_max_sample_rate to 38500

紧接着是 13:55:53 的重启日志。

据我所知,“mce”日志记录已被“rasdaemon”取代,两者都在上面提到过。

$ find /sys/kernel/debug/tracing  -type f  \! -empty

什么也没找到。

该目录中有超过 22,000 个文件,全部为空,并且都是在重新启动时创建的。

这是 rasdaemon 保存其信息的地方吗?如果是的话,如果重启后所有信息都归零,那么它有什么用?

答案1

下面的所有内容/sys通常都是内核的虚拟文件系统,特别/sys/kernel/debug/tracing踪迹。这与 无关rasdaemon

如果以参数/rasdaemon启动,它会将事件存储在 Sqlite3 数据库中,该数据库在我的系统上位于。可以使用 来检查该数据库。-r--record/var/lib/rasdaemon/ras-mc_event.dbras-mc-ctl --errors

答案2

  1. rasdaemon 的日志通过 syslog/journald 报告。

rasdaemon 程序是一个守护进程,用于监视来自 Linux 内核跟踪事件的平台可靠性、可用性和可服务性 (RAS) 报告。这些跟踪事件记录在 /sys/kernel/debug/tracing 中,并通过 syslog/journald 报告它们。

https://github.com/mchehab/rasdaemon/blob/master/man/rasdaemon.1.in

您可以通过journalctl获取日志。

#journalctl | tail -n 100
Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: <idle>-0     [-85410864]     0.000960: mc_event:             2023-07-12 20:24:45 +0800 1 Corrected error: single-symbol chipkill ECC on unknown memory (mc: 0 address: 0x400abb3a400 grain: 0 APEI location: node:0 card:5 module:0 rank:1 bank_group:0 bank_address:3 device:0 row:174 column:1280 chip_id:0 status(0x0000000000000400): Storage error in DRAM memory)

Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: cpu 19:rasdaemon: mc_event store: 0xaaaab9491ff8
Jul 12 20:27:24 localhost.localdomain rasdaemon[39806]: rasdaemon: register inserted at db
  1. ras 事件是内核发出的跟踪点,您可以自己通过 debugfs 监视它们。
# ls /sys/kernel/debug/tracing/events/ras/mc_event/
enable  filter  format  hist  id  trigger

#cat /sys/kernel/debug/tracing/events/ras/mc_event/id
1188

# cd /sys/kernel/debug/tracing/events/ras/mc_event/
#echo 1 > enable 

# cd /sys/kernel/debug/tracing/
# cat trace_pipe
          <idle>-0       [074] dnh.  7251.551618: mc_event: 1 Corrected error: Single-symbol ChipKill ECC on unknown memory (mc:0 location:-1:-1:-1 address:0x40098e20900 grain:1 syndrome:0x00000000 APEI location: node:0 card:4 module:0 rank:1 bank_group:2 bank_address:0 row:99 col:64 chipID: 0 status(0x0000000000000400): Storage error in DRAM memory)

# cat trace
# tracer: nop
#
# entries-in-buffer/entries-written: 1/1   #P:128
#
#                                _-----=> irqs-off
#                               / _----=> need-resched
#                              | / _---=> hardirq/softirq
#                              || / _--=> preempt-depth
#                              ||| /     delay
#           TASK-PID     CPU#  ||||   TIMESTAMP  FUNCTION
#              | |         |   ||||      |         |
          <idle>-0       [075] d.h.  7323.829675: mc_event: 1 Corrected error: Single-symbol ChipKill ECC on unknown memory (mc:0 location:-1:-1:-1 address:0x40098e20900 grain:1 syndrome:0x00000000 APEI location: node:0 card:4 module:0 rank:1 bank_group:2 bank_address:0 row:99 col:64 chipID: 0 status(0x0000000000000400): Storage error in DRAM memory)
  1. 跟踪点由 rasdeamon 监控,如果以参数 启动,则最终持久存储在 Sqlite3 数据库中-r/--record
#systemctl status rasdaemon.service
● rasdaemon.service - RAS daemon to log the RAS events
   Loaded: loaded (/usr/lib/systemd/system/rasdaemon.service; disabled; vendor preset: disabled)
   Active: active (running) since Wed 2023-07-12 15:40:42 CST; 3s ago
  Process: 40597 ExecStartPost=/usr/sbin/rasdaemon --enable (code=exited, status=0/SUCCESS)
 Main PID: 40596 (rasdaemon)
    Tasks: 1
   Memory: 440.0K
   CGroup: /system.slice/rasdaemon.service
           └─40596 /usr/sbin/rasdaemon -f -r

#ras-mc-ctl --errors
Memory controller events:
1 2023-07-12 15:42:21 +0800 1 Info error(s): memory read error at CPU_SrcID#0_MC#0_Chan#0_DIMM#0 location: 0:0:0:-1, xxxx
No Extlog errors.
PCIe AER events:
1 2023-07-12 17:00:56 +0800 Corrected error: Data Link Protocol
MCE events:
1 2023-07-12 15:42:21 +0800 error: MEMORY CONTROLLER RD_CHANNEL0_ERR Transaction: Memory read error, mcg xxxx

相关内容