我在 Hetzner (AX101) 上运行一些服务器,一段时间以来一直遇到随机重启,我所有的调查都毫无结果。
先决条件:Ubuntu 22.04(Ubuntu 5.15.0-58.64-generic 5.15.74)
从系统的角度来看似乎什么也没有发生:
Feb 6 10:44:00 server4 kernel: [256072.858601] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:08:00 SRC=185.156.73.150 DST=138.201.121.186 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=26829 PROTO=TCP SPT=53764 DPT=5
492 WINDOW=1024 RES=0x00 SYN URGP=0
Feb 6 10:44:37 server4 kernel: [256110.138416] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:86:dd SRC=240b:4005:0018:3b00:88cd:89dd:7daf:c400 DST=2a01:04f8:0172:24e2:0000:0000:0000:0002 LEN=60 TC=0 HOPLIMI
T=245 FLOWLBL=0 PROTO=TCP SPT=35153 DPT=20000 WINDOW=65535 RES=0x00 SYN URGP=0
Feb 6 10:46:18 server4 kernel: [ 0.000000] Linux version 5.15.0-58-generic (buildd@lcy02-amd64-101) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023
(Ubuntu 5.15.0-58.64-generic 5.15.74)
Feb 6 10:46:18 server4 kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-58-generic root=UUID=76ab4da2-200e-48f1-8831-51fcf6935563 ro consoleblank=0 systemd.show_status=true nomodeset consoleblank=0
Feb 6 10:46:18 server4 kernel: [ 0.000000] KERNEL supported cpus:
Feb 6 10:46:18 server4 kernel: [ 0.000000] Intel GenuineIntel
Feb 6 10:46:18 server4 kernel: [ 0.000000] AMD AuthenticAMD
Feb 6 10:46:18 server4 kernel: [ 0.000000] Hygon HygonGenuine
Feb 6 10:46:18 server4 kernel: [ 0.000000] Centaur CentaurHauls
Feb 6 10:46:18 server4 kernel: [ 0.000000] zhaoxin Shanghai
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: xstate_offset[2]: 576, xstate_sizes[2]: 256
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: xstate_offset[9]: 832, xstate_sizes[9]: 8
Feb 6 10:46:18 server4 kernel: [ 0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Feb 6 10:46:18 server4 kernel: [ 0.000000] signal: max sigframe size: 3376
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-provided physical RAM map:
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x0000000009bff000-0x0000000009ffffff] reserved
Feb 6 10:46:18 server4 kernel: [ 0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable
一切都按预期进行,直到出现问题。服务器停机两分钟,然后重新启动系统。
NVMe 磁盘看上去非常好:
smartctl -A /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 40 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 55,954,348 [28.6 TB]
Data Units Written: 76,540,527 [39.1 TB]
Host Read Commands: 993,043,774
Host Write Commands: 1,875,329,624
Controller Busy Time: 1,396
Power Cycles: 5
Power On Hours: 4,902
Unsafe Shutdowns: 0
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 40 Celsius
Temperature Sensor 2: 49 Celsius
我也进行了内存测试,结果没有问题。
从软件角度来看,没有什么特别的在那里运行:PostgreSQL,节点导出器 - 基本上就是这样。
我联系了 Hetzner 来解决此问题,他们甚至更换了所有硬件 - 但问题仍然存在,这让我认为可能是软件问题(怀疑是电涌)。
我可以从任何方向进一步深入研究这个问题吗?
答案1
嗯嗯,专用的租用主机意外重启。没有什么主机已记录此信息。主机中的所有硬件均已更换。
这是非常不太可能由用户空间程序引起,否则您会在日志中看到痕迹。
可能与内核有关 - 但您大约每两周更换一次。
我认为电源是最有可能的问题。主机是否有双/冗余 PSU 由不同的 UPS 供电?显然要检查的是 PDU 上的监控 - 但您无法访问它。接下来要查看的是 BMC 控制器日志 - IPMI/iDRAC/iLO - 但您可能也无法访问它们?