如何测试 Linux 服务器的硬件错误?

如何测试 Linux 服务器的硬件错误?

我有一台 Debian 10 服务器,它随机重启,尽管没有写入任何错误journald。该服务器在过去 3 天内重启了 20 次。

$ journalctl --list-boots
-22 bdb1799f0c9a4e81af6d41b0bd6c5cd9 Tue 2023-01-17 12:42:00 UTC—Sat 2023-01-21 22:01:24 UTC
...
 -2 e306cc0481784a0cad5e7138b0fcfcdb Mon 2023-01-23 13:18:52 UTC—Mon 2023-01-23 13:28:54 UTC
 -1 e4ca2701610640cfb11c39c38d05c091 Mon 2023-01-23 13:32:02 UTC—Mon 2023-01-23 13:34:27 UTC
  0 d5c51684dc6e4538a241216f400d9ca7 Tue 2023-01-24 10:23:51 UTC—Tue 2023-01-24 13:10:04 UTC

通常我运行memtester需要几个小时(取决于 RAM 大小)并且实际上不太可能重现该问题(如果它真的是内存)。

$ apt install memtester
$ memtester 245GB 4 > memtester.log 2>&1

我的服务器有 256GB RAM,分为 16 个 RAM 模块:

$ dmidecode -t memory | grep Size | wc -l
16
free  -h
             total       used       free     shared    buffers     cached
Mem:          251G        32G       218G       113M         0B       135M
-/+ buffers/cache:        32G       219G
Swap:           0B         0B         0B

DDR3模块:

Handle 0x002D, DMI type 17, 34 bytes
Memory Device
        Array Handle: 0x002B
        Error Information Handle: Not Provided
        Total Width: 72 bits
        Data Width: 64 bits
        Size: 16384 MB
        Form Factor: DIMM
        Set: None
        Locator: P1-DIMMA1
        Bank Locator: P0_Node0_Channel0_Dimm0
        Type: DDR3
        Type Detail: Registered (Buffered)
        Speed: 1600 MHz
        Manufacturer: Hynix Semiconducto
        Serial Number: 093C2E1C          
        Asset Tag: Dimm0_AssetTag
        Part Number: HMT42GR7AFR4C-RD
        Rank: 2
        Configured Clock Speed: 1600 MHz

更新:系统应该有ECC内存模块(似乎可以在中检测到dmidecode -t memory

Handle 0x002B, DMI type 16, 23 bytes
Physical Memory Array
        Location: System Board Or Motherboard
        Use: System Memory
        Error Correction Type: Multi-bit ECC
        Maximum Capacity: 512 GB
        Error Information Handle: Not Provided
        Number Of Devices: 8

更换所有内存模块后,系统显示EDAC MC0错误(我以前从未见过)

Jan 24 14:47:07 kernel: perf: interrupt took too long (2527 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
Jan 24 15:00:13 kernel: perf: interrupt took too long (3174 > 3158), lowering kernel.perf_event_max_sample_rate to 63000
Jan 24 15:19:20 kernel: perf: interrupt took too long (3984 > 3967), lowering kernel.perf_event_max_sample_rate to 50000
Jan 24 16:01:03 kernel: perf: interrupt took too long (4983 > 4980), lowering kernel.perf_event_max_sample_rate to 40000
Jan 24 17:43:25 kernel: perf: interrupt took too long (6233 > 6228), lowering kernel.perf_event_max_sample_rate to 32000
Jan 24 19:02:54 kernel: mce: [Hardware Error]: Machine check events logged
Jan 24 19:02:54 kernel: EDAC sbridge MC0: HANDLING MCE MEMORY ERROR
Jan 24 19:02:54 kernel: EDAC sbridge MC0: CPU 0: Machine Check Event: 0 Bank 10: 8c00004f000800c1
Jan 24 19:02:54 kernel: EDAC sbridge MC0: TSC 2fe1a1819026 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: ADDR 1ff0136000 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: MISC 908400400041e8c 
Jan 24 19:02:54 kernel: EDAC sbridge MC0: PROCESSOR 0:306e4 TIME 1674586974 SOCKET 0 APIC 0
Jan 24 19:02:54 kernel: EDAC MC0: 1 CE memory scrubbing error on CPU_SrcID#0_Ha#0_Chan#0_DIMM#0 (channel:0 slot:0 page:0x1ff0136 offset:0x0 grain:32 syndrome:0x0 -  area:DRAM err_code:0008:00c1 socket:0 ha:0 channel_mask:1 rank:1)

更新2 我已尝试禁用edac内核模块,如红帽/Suse为了排除模块与主板上的硬件校正冲突的可能性

echo "blacklist sb_edac" >> /etc/modprobe.d/50-blacklist.conf

这似乎可以防止重新启动,但内存分配失败(在工作负载上)。所有内存测试仍通过。

Hardware name: Supermicro X9DRFR/X9DRFR, BIOS 3.2 01/16/2015
Call Trace:
 dump_stack+0x66/0x81
 dump_header+0x6b/0x283
 ? ___ratelimit+0xa1/0x100
 oom_kill_process.cold.30+0xb/0x1cf
 out_of_memory+0x1a5/0x450
 mem_cgroup_out_of_memory+0xbe/0xd0
 try_charge+0x707/0x780
 mem_cgroup_try_charge+0x86/0x190
 __add_to_page_cache_locked+0x64/0x240
 add_to_page_cache_lru+0x4a/0xe0
 filemap_fault+0x34c/0x780
 ? filemap_map_pages+0x1ed/0x3a0
 ext4_filemap_fault+0x2c/0x40 [ext4]
 __do_fault+0x36/0x170
 __handle_mm_fault+0xdb6/0x11b0
 handle_mm_fault+0xd6/0x200
 __do_page_fault+0x249/0x4f0
 ? page_fault+0x8/0x30
 page_fault+0x1e/0x30
RIP: 0033:0x7f1e1d58ff9d
Code: Bad RIP value.
RSP: 002b:00007fff6a4fd3d8 EFLAGS: 00010202
RAX: 00007f1e183501e0 RBX: 00007f10cbf0a638 RCX: 0000000000000040
RDX: 0000000000000006 RSI: 00007f1e183501e6 RDI: 00007f10cbf0a626
RBP: 00007f10cbf0b3e8 R08: 0000000000000006 R09: 0000000000000007
R10: c2bdb975b17afafd R11: 00007f1e1d5b6060 R12: 00007f1e183501b0
R13: 0000000000000005 R14: 00007f10cbf093c0 R15: 00007f10cbf0b3c8
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 101eeb22ce3e ADDR 1ff19b6000 MISC 908400400041e8c 
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674617922 SOCKET 0 APIC 0 microcode 428
mce: [Hardware Error]: Machine check events logged
mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 10: 8c00004f000800c1
mce: [Hardware Error]: TSC 19a7daf91fd4 ADDR 1ff19b6000 MISC 908400400041e8c 
mce: [Hardware Error]: PROCESSOR 0:306e4 TIME 1674621954 SOCKET 0 APIC 0 microcode 428

答案1

您是否尝试过从https://www.memtest86.com/- 对我来说它一直都很棒。

相关内容