为什么我的 Linux 机器会随机重启？

2024-6-15 • tag-icon

我最近raid5在一台无头机器上遇到了阵列故障（四个驱动器中有两个发生故障），我把它放在一个柜子里，用作文件服务器。我没有进行足够的监控，所以没有注意到第一个驱动器发生故障。

我已经更换了两个驱动器，并且已raid6重建XFS。

为了监控，我已经设置mdmonitor并smartd（配置如下）。

以前，系统可以连续运行数月而不会出现任何不稳定情况（第一个驱动器发生故障时，系统运行了 6 个月！）。但是现在，系统开始重新启动，我不知道是什么原因造成的。

据我所知，系统中唯一的变化是我从更改为raid5/ext4并raid6/xfs启用了mdmonitor和smartd。

您可以看到它正在多次重启！

last reboot:

reboot   system boot  3.9.10-100.fc17. Tue Jun  3 13:36 - 14:23  (00:46)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 12:26 - 14:23  (01:56)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 10:20 - 14:23  (04:02)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 09:07 - 14:23  (05:15)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 07:58 - 14:23  (06:24)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 06:49 - 14:23  (07:33)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 05:35 - 14:23  (08:47)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 04:27 - 14:23  (09:55)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 03:17 - 14:23  (11:05)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 02:22 - 14:23  (12:00)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 01:12 - 14:23  (13:10)    
reboot   system boot  3.9.10-100.fc17. Tue Jun  3 00:04 - 14:23  (14:19)    
reboot   system boot  3.9.10-100.fc17. Mon Jun  2 22:51 - 14:23  (15:32)    
reboot   system boot  3.9.10-100.fc17. Mon Jun  2 21:29 - 14:23  (16:53)    
reboot   system boot  3.9.10-100.fc17. Mon Jun  2 20:15 - 14:23  (18:07)    
reboot   system boot  3.9.10-100.fc17. Mon Jun  2 19:01 - 14:23  (19:21)    
reboot   system boot  3.9.10-100.fc17. Mon Jun  2 16:26 - 14:23  (21:56)

/var/log/messages以下是未知重启前后的摘录：

/var/log/messages:

09:38:15 smartd[641]: Device: /dev/sda [SAT], SMART Usage Attribute: 188 Command_Timeout changed from 99 to 100
09:38:17 smartd[641]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 99 to 100
09:54:57 kernel: [ 2848.075773] Clocksource tsc unstable (delta = -631754440 ns)
09:54:57 kernel: [ 2848.076234] Switching to clocksource hpet
10:08:15 smartd[641]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
10:08:15 smartd[641]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 38 to 39
10:13:12 dbus-daemon[694]: dbus[694]: [system] Activating service name='org.freedesktop.PackageKit' (using servicehelper)
10:13:12 dbus[694]: [system] Activating service name='org.freedesktop.PackageKit' (using servicehelper)
10:13:12 dbus-daemon[694]: dbus[694]: [system] Successfully activated service 'org.freedesktop.PackageKit'
10:13:12 dbus[694]: [system] Successfully activated service 'org.freedesktop.PackageKit'
10:20:55 kernel: imklog 5.8.10, log source = /proc/kmsg started.
10:20:55 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="622" x-info="http://www.rsyslog.com"] start
10:20:55 kernel: [    0.000000] Initializing cgroup subsys cpuset
10:20:55 kernel: [    0.000000] Initializing cgroup subsys cpu
10:20:55 kernel: [    0.000000] Linux version 3.9.10-100.fc17.x86_64 ([email protected]) (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) (GCC) ) #1 SMP Sun Jul 14 01
:31:27 UTC 2013

/etc/mdadm.conf:

ARRAY /dev/md0 metadata=1.2 name=nas:0 UUID=05f5ca2c:db826606:c2ae0648:2da1b4a0
MAILADDR ...
MAILFROM ...

/etc/smartd.conf:（取自这里）

DEVICESCAN
 -a              \ # Implies all standard testing and reporting.
 -n standby,10,q \ # Don't spin up disk if it is currently spun down
                 \ #   unless it is 10th attempt in a row. 
                 \ #   Don't report unsuccessful attempts anyway.
 -o on           \ # Automatic offline tests (usually every 4 hours).
 -S on           \ # Attribute autosave (I don't really understand
                 \ #   what it is for. If you can explain it to me
                 \ #   please drop me a line.
 -R 194          \ # Show real temperature in the logs.
 -R 231          \ # The same as above.
 -I 194          \ # Ignore temperature attribute changes
 -W 3,50,50      \ # Notify if the temperature changes 3 degrees
                 \ #   comparing to the last check or if
                 \ #   the temperature exceeds 50 degrees.
 -s (S/../.././02|L/../../1/22) \ # short test: every day between 2-3am
                                \ # long test every Monday between 10pm-2am
                                \ # (Long test takes a lot of time
                                \ # and it should be finished before
                                \ # daily short test starts.
                                \ # At 3am every day this disk will be
                                \ # utilized heavily as a backup storage)
 -m root         \ # To whom we should send mails.
 -M exec /usr/libexec/smartmontools/smartdnotify

有谁知道是什么原因导致重启？

边注：

顺便问一下，消息日志的第二行是否暗示另一个驱动器故障？

SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 99 to 100

考虑到原来的四个驱动器（其中两个出现故障）是同时购买的，我猜剩下的两个驱动器也可能接近故障？

相关内容