如何调试 Linux 服务器重启？

Question 1

我写了一个bash 中的简单工具自动收集有关重启的其他信息。该脚本内部使用journalctl，因此它可能适用于使用 Systemd 的任何 Linux 发行版。

这个想法很简单，对于每个会话，我们都想检查日志以获取更多信息，检查已知条目：

系統已接收SIGTERM
要求关闭
SEGFAULT
核心BUG

确认崩溃很复杂。这就是为什么有些行被标记为的原因CRASH?。这意味着此类日志突然结束而没有识别出错误消息。在某些情况下SEGFAULT可能会记录，有时则不会。

这可能有助于操作员关注具有可疑条目的启动会话。

$ crashctl
Distribution        : Debian GNU/Linux 10 (buster)
Kernel              : 4.19.0-23-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20)
Current boot        : 606aaecb-b14d-4bbc-9598-b6c60233a888
Scaled load         : 0.04 0.01 0.00 
System installed    : Tue Jan  3 09:26:13 UTC 2023
System started      : Mon Feb  6 03:11:44 CET 2023
Uptime              : up 7 days
Running processes   : 384
kdump               : current state   : ready to kdump
Boot First message             Last message             Uptime       Reboot/Crash
-------------------------------------------------------------------------------------
-11  2022-12-05 20:43:53 UTC   2022-12-05 20:52:00 UTC  0d 00:08:07  reboot (SIGTERM)
-10  2022-12-06 07:56:01 UTC   2022-12-06 15:14:36 UTC  0d 07:18:35  CRASH?
-9   2022-12-07 12:28:07 UTC   2022-12-10 16:33:43 UTC  3d 04:05:36  reboot (SIGTERM)
-8   2022-12-12 08:56:05 UTC   2022-12-18 08:18:40 UTC  5d 23:22:35  CRASH?
-7   2022-12-18 08:32:27 UTC   2022-12-25 10:54:03 UTC  7d 02:21:36  reboot (SIGTERM)
-6   2022-12-28 10:51:54 UTC   2022-12-29 12:12:32 UTC  1d 01:20:38  Power key pressed, but ignored
-5   2023-01-02 08:45:54 UTC   2023-01-06 08:05:01 UTC  3d 23:19:07  CRASH?
-4   2023-01-06 10:07:00 UTC   2023-01-12 10:01:25 UTC  5d 23:54:25  Power key pressed, but ignored
-3   2023-01-12 10:04:36 UTC   2023-01-28 14:07:19 UTC  16d 04:02:43 reboot (SIGTERM)
-2   2023-01-30 08:43:42 UTC   2023-01-31 07:27:26 UTC  0d 22:43:44  reboot (SIGTERM)
-1   2023-02-02 12:41:51 UTC   2023-02-04 13:16:19 UTC  2d 00:34:28  reboot (SIGTERM)
0    2023-02-06 03:12:01 UTC   2023-02-13 18:17:52 UTC  7d 15:05:51  running

Answer

我写了一个bash 中的简单工具自动收集有关重启的其他信息。该脚本内部使用journalctl，因此它可能适用于使用 Systemd 的任何 Linux 发行版。

这个想法很简单，对于每个会话，我们都想检查日志以获取更多信息，检查已知条目：

系統已接收SIGTERM
要求关闭
SEGFAULT
核心BUG

确认崩溃很复杂。这就是为什么有些行被标记为的原因CRASH?。这意味着此类日志突然结束而没有识别出错误消息。在某些情况下SEGFAULT可能会记录，有时则不会。

这可能有助于操作员关注具有可疑条目的启动会话。

$ crashctl
Distribution        : Debian GNU/Linux 10 (buster)
Kernel              : 4.19.0-23-amd64 #1 SMP Debian 4.19.269-1 (2022-12-20)
Current boot        : 606aaecb-b14d-4bbc-9598-b6c60233a888
Scaled load         : 0.04 0.01 0.00 
System installed    : Tue Jan  3 09:26:13 UTC 2023
System started      : Mon Feb  6 03:11:44 CET 2023
Uptime              : up 7 days
Running processes   : 384
kdump               : current state   : ready to kdump
Boot First message             Last message             Uptime       Reboot/Crash
-------------------------------------------------------------------------------------
-11  2022-12-05 20:43:53 UTC   2022-12-05 20:52:00 UTC  0d 00:08:07  reboot (SIGTERM)
-10  2022-12-06 07:56:01 UTC   2022-12-06 15:14:36 UTC  0d 07:18:35  CRASH?
-9   2022-12-07 12:28:07 UTC   2022-12-10 16:33:43 UTC  3d 04:05:36  reboot (SIGTERM)
-8   2022-12-12 08:56:05 UTC   2022-12-18 08:18:40 UTC  5d 23:22:35  CRASH?
-7   2022-12-18 08:32:27 UTC   2022-12-25 10:54:03 UTC  7d 02:21:36  reboot (SIGTERM)
-6   2022-12-28 10:51:54 UTC   2022-12-29 12:12:32 UTC  1d 01:20:38  Power key pressed, but ignored
-5   2023-01-02 08:45:54 UTC   2023-01-06 08:05:01 UTC  3d 23:19:07  CRASH?
-4   2023-01-06 10:07:00 UTC   2023-01-12 10:01:25 UTC  5d 23:54:25  Power key pressed, but ignored
-3   2023-01-12 10:04:36 UTC   2023-01-28 14:07:19 UTC  16d 04:02:43 reboot (SIGTERM)
-2   2023-01-30 08:43:42 UTC   2023-01-31 07:27:26 UTC  0d 22:43:44  reboot (SIGTERM)
-1   2023-02-02 12:41:51 UTC   2023-02-04 13:16:19 UTC  2d 00:34:28  reboot (SIGTERM)
0    2023-02-06 03:12:01 UTC   2023-02-13 18:17:52 UTC  7d 15:05:51  running

Question 2

请尝试以下顺序。

检查last或journalctl --list-boots命令输出并获取任何重启的日期和时间（搜索时保持日期时间格式相同）。
打开/var/log/messages文件并搜索相同日期时间。如果日志已轮换，请检查旧日志。
检查重启之前发生的情况。
如果您看到stopping服务声明，则表示服务器已由用户正常重启、或按计划重启、或通过控制台重启（对于云实例而言）。
如果发生崩溃，您将看到崩溃痕迹。

Answer

请尝试以下顺序。

检查last或journalctl --list-boots命令输出并获取任何重启的日期和时间（搜索时保持日期时间格式相同）。
打开/var/log/messages文件并搜索相同日期时间。如果日志已轮换，请检查旧日志。
检查重启之前发生的情况。
如果您看到stopping服务声明，则表示服务器已由用户正常重启、或按计划重启、或通过控制台重启（对于云实例而言）。
如果发生崩溃，您将看到崩溃痕迹。

如何调试 Linux 服务器重启？

答案1

答案2

相关内容