我有一个已经运行了几年的 GCE 实例。晚上,该实例重新启动,并显示以下日志:
2022-02-13 04:46:36.370 CET compute.instances.hostError Instance terminated by Compute Engine.
2022-02-13 04:47:08.279 CET compute.instances.automaticRestart Instance automatically restarted by Compute Engine.
但是实例并未重新启动。
我可以连接到串行控制台,在那里我看到以下内容:
serialport: Connected to ***.europe-west1-b.*** port 1 (
[ TIME ] Timed out waiting for device ***
[DEPEND] Dependency failed for File… ***.
[DEPEND] Dependency failed for /data.
[DEPEND] Dependency failed for Local File Systems.
[ OK ] Stopped Dispatch Password …ts to Console Directory Watch.
[ OK ] Stopped Forward Password R…uests to Wall Directory Watch.
[ OK ] Reached target Timers.
Starting Raise network interfaces...
[ OK ] Closed Syslog Socket.
[ OK ] Reached target Login Prompts.
[ OK ] Reached target Paths.
[ OK ] Reached target Sockets.
[ OK ] Started Emergency Shell.
[ OK ] Reached target Emergency Mode.
Starting Create Volatile Files and Directories...
[ OK ] Finished Create Volatile Files and Directories.
Starting Network Time Synchronization...
Starting Update UTMP about System Boot/Shutdown...
[ OK ] Finished Update UTMP about System Boot/Shutdown.
Starting Update UTMP about System Runlevel Changes...
[ OK ] Finished Update UTMP about System Runlevel Changes.
[ OK ] Started Network Time Synchronization.
[ OK ] Reached target System Time Set.
[ OK ] Reached target System Time Synchronized.
Stopping Network Time Synchronization...
[ OK ] Stopped Network Time Synchronization.
Starting Network Time Synchronization...
[ OK ] Started Network Time Synchronization.
[ OK ] Finished Raise network interfaces.
[ OK ] Reached target Network.
[ OK ] Reached target Network is Online.
You are in emergency mode. After logging in, type "journalctl -xb" to view
system logs, "systemctl reboot" to r
Cannot open access to console, the root account is locked.
See sulogin(8) man page for more details.
Press Enter to continue.
似乎其中一个磁盘无法连接 - 但我现在该怎么办?该磁盘似乎在计算引擎内正常可用。
答案1
恐怕您无法对这台受影响的虚拟机采取任何措施。
主机错误(
compute.instances.hostError
) 表示托管虚拟机的物理机上存在硬件或软件问题,导致虚拟机崩溃。主机错误涉及硬件完全故障或其他硬件问题,可能会阻止实时迁移您的虚拟机。
VM 实例位于“云”中,它仍然是运行您的工作负载的物理机器。不幸的是,此实例出现硬件或软件故障,您无能为力。
GCP 引入了一种称为实时迁移从而避免了这种情况的发生。
Compute Engine 提供实时迁移,即使发生主机系统事件(例如软件或硬件更新)也能让您的虚拟机实例保持运行,但是我猜配置这个已经太晚了。
...
实时迁移可使您的实例在以下期间保持运行:
- 定期基础设施维护和升级。
- 数据中心的网络和电网维护。
- 发生故障的硬件,如内存、CPU、网络接口卡、磁盘、电源等。这是尽最大努力完成的;如果硬件完全发生故障或以其他方式阻止实时迁移,则 VM 会自动崩溃并重新启动,并记录 hostError。
...
实时迁移不会改变虚拟机本身的任何属性或特性。实时迁移过程只是将正在运行的虚拟机从一台主机转移到同一区域内的另一台主机。
可能的解决方法
正如您提到的那样,磁盘是持久的,并且在 GCP 中仍然可见,您可以尝试将它们重新连接到另一个 VM。操作指南可以在创建并附加磁盘文档。
答案2
我终于找到了这个错误的奇怪原因 - 参见原文/etc/fstab
:
/dev/disk/by-id/google-***-data /data ext4 discard,defaults 0 2
但是这条路径上没有这样的设备。我通过附加解决了这个问题/dev/sdb
,但我想这不是最好的解决方案。我想知道这是怎么发生的,设备突然完全消失,最后导致机器死亡。