Ubuntu 随机重启

Ubuntu 随机重启

我在 Hetzner (AX101) 上运行一些服务器,一段时间以来一直遇到随机重启,我所有的调查都毫无结果。

先决条件:Ubuntu 22.04(Ubuntu 5.15.0-58.64-generic 5.15.74)

从系统的角度来看似乎什么也没有发生:

Feb  6 10:44:00 server4 kernel: [256072.858601] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:08:00 SRC=185.156.73.150 DST=138.201.121.186 LEN=40 TOS=0x00 PREC=0x00 TTL=250 ID=26829 PROTO=TCP SPT=53764 DPT=5
492 WINDOW=1024 RES=0x00 SYN URGP=0
Feb  6 10:44:37 server4 kernel: [256110.138416] [UFW BLOCK] IN=enp41s0 OUT= MAC=a8:a1:59:c0:b0:d0:00:31:46:0d:3d:f3:86:dd SRC=240b:4005:0018:3b00:88cd:89dd:7daf:c400 DST=2a01:04f8:0172:24e2:0000:0000:0000:0002 LEN=60 TC=0 HOPLIMI
T=245 FLOWLBL=0 PROTO=TCP SPT=35153 DPT=20000 WINDOW=65535 RES=0x00 SYN URGP=0
Feb  6 10:46:18 server4 kernel: [    0.000000] Linux version 5.15.0-58-generic (buildd@lcy02-amd64-101) (gcc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0, GNU ld (GNU Binutils for Ubuntu) 2.38) #64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023
 (Ubuntu 5.15.0-58.64-generic 5.15.74)
Feb  6 10:46:18 server4 kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-5.15.0-58-generic root=UUID=76ab4da2-200e-48f1-8831-51fcf6935563 ro consoleblank=0 systemd.show_status=true nomodeset consoleblank=0
Feb  6 10:46:18 server4 kernel: [    0.000000] KERNEL supported cpus:
Feb  6 10:46:18 server4 kernel: [    0.000000]   Intel GenuineIntel
Feb  6 10:46:18 server4 kernel: [    0.000000]   AMD AuthenticAMD
Feb  6 10:46:18 server4 kernel: [    0.000000]   Hygon HygonGenuine
Feb  6 10:46:18 server4 kernel: [    0.000000]   Centaur CentaurHauls
Feb  6 10:46:18 server4 kernel: [    0.000000]   zhaoxin   Shanghai
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Supporting XSAVE feature 0x200: 'Protection Keys User registers'
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: xstate_offset[9]:  832, xstate_sizes[9]:    8
Feb  6 10:46:18 server4 kernel: [    0.000000] x86/fpu: Enabled xstate features 0x207, context size is 840 bytes, using 'compacted' format.
Feb  6 10:46:18 server4 kernel: [    0.000000] signal: max sigframe size: 3376
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-provided physical RAM map:
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009ebff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000009ec00-0x000000000009ffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x00000000000e0000-0x00000000000fffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000000100000-0x0000000009bfefff] usable
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x0000000009bff000-0x0000000009ffffff] reserved
Feb  6 10:46:18 server4 kernel: [    0.000000] BIOS-e820: [mem 0x000000000a000000-0x000000000a1fffff] usable

一切都按预期进行,直到出现问题。服务器停机两分钟,然后重新启动系统。

NVMe 磁盘看上去非常好:

smartctl -A /dev/nvme0n1
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-58-generic] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF SMART DATA SECTION ===
SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        40 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    55,954,348 [28.6 TB]
Data Units Written:                 76,540,527 [39.1 TB]
Host Read Commands:                 993,043,774
Host Write Commands:                1,875,329,624
Controller Busy Time:               1,396
Power Cycles:                       5
Power On Hours:                     4,902
Unsafe Shutdowns:                   0
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               40 Celsius
Temperature Sensor 2:               49 Celsius

我也进行了内存测试,结果没有问题。

从软件角度来看,没有什么特别的在那里运行:PostgreSQL,节点导出器 - 基本上就是这样。

我联系了 Hetzner 来解决此问题,他们甚至更换了所有硬件 - 但问题仍然存在,这让我认为可能是软件问题(怀疑是电涌)。

我可以从任何方向进一步深入研究这个问题吗?

答案1

嗯嗯,专用的租用主机意外重启。没有什么主机已记录此信息。主机中的所有硬件均已更换。

这是非常不太可能由用户空间程序引起,否则您会在日志中看到痕迹。

可能与内核有关 - 但您大约每两周更换一次。

我认为电源是最有可能的问题。主机是否有双/冗余 PSU 由不同的 UPS 供电?显然要检查的是 PDU 上的监控 - 但您无法访问它。接下来要查看的是 BMC 控制器日志 - IPMI/iDRAC/iLO - 但您可能也无法访问它们?

相关内容