我最近raid5
在一台无头机器上遇到了阵列故障(四个驱动器中有两个发生故障),我把它放在一个柜子里,用作文件服务器。我没有进行足够的监控,所以没有注意到第一个驱动器发生故障。
我已经更换了两个驱动器,并且已raid6
重建XFS
。
为了监控,我已经设置mdmonitor
并smartd
(配置如下)。
以前,系统可以连续运行数月而不会出现任何不稳定情况(第一个驱动器发生故障时,系统运行了 6 个月!)。但是现在,系统开始重新启动,我不知道是什么原因造成的。
据我所知,系统中唯一的变化是我从 更改为raid5/ext4
并raid6/xfs
启用了mdmonitor
和smartd
。
您可以看到它正在多次重启!
last reboot:
reboot system boot 3.9.10-100.fc17. Tue Jun 3 13:36 - 14:23 (00:46)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 12:26 - 14:23 (01:56)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 10:20 - 14:23 (04:02)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 09:07 - 14:23 (05:15)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 07:58 - 14:23 (06:24)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 06:49 - 14:23 (07:33)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 05:35 - 14:23 (08:47)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 04:27 - 14:23 (09:55)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 03:17 - 14:23 (11:05)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 02:22 - 14:23 (12:00)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 01:12 - 14:23 (13:10)
reboot system boot 3.9.10-100.fc17. Tue Jun 3 00:04 - 14:23 (14:19)
reboot system boot 3.9.10-100.fc17. Mon Jun 2 22:51 - 14:23 (15:32)
reboot system boot 3.9.10-100.fc17. Mon Jun 2 21:29 - 14:23 (16:53)
reboot system boot 3.9.10-100.fc17. Mon Jun 2 20:15 - 14:23 (18:07)
reboot system boot 3.9.10-100.fc17. Mon Jun 2 19:01 - 14:23 (19:21)
reboot system boot 3.9.10-100.fc17. Mon Jun 2 16:26 - 14:23 (21:56)
/var/log/messages
以下是未知重启前后的摘录:
/var/log/messages:
09:38:15 smartd[641]: Device: /dev/sda [SAT], SMART Usage Attribute: 188 Command_Timeout changed from 99 to 100
09:38:17 smartd[641]: Device: /dev/sdd [SAT], SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 99 to 100
09:54:57 kernel: [ 2848.075773] Clocksource tsc unstable (delta = -631754440 ns)
09:54:57 kernel: [ 2848.076234] Switching to clocksource hpet
10:08:15 smartd[641]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 62 to 61
10:08:15 smartd[641]: Device: /dev/sda [SAT], SMART Usage Attribute: 194 Temperature_Celsius changed from 38 to 39
10:13:12 dbus-daemon[694]: dbus[694]: [system] Activating service name='org.freedesktop.PackageKit' (using servicehelper)
10:13:12 dbus[694]: [system] Activating service name='org.freedesktop.PackageKit' (using servicehelper)
10:13:12 dbus-daemon[694]: dbus[694]: [system] Successfully activated service 'org.freedesktop.PackageKit'
10:13:12 dbus[694]: [system] Successfully activated service 'org.freedesktop.PackageKit'
10:20:55 kernel: imklog 5.8.10, log source = /proc/kmsg started.
10:20:55 rsyslogd: [origin software="rsyslogd" swVersion="5.8.10" x-pid="622" x-info="http://www.rsyslog.com"] start
10:20:55 kernel: [ 0.000000] Initializing cgroup subsys cpuset
10:20:55 kernel: [ 0.000000] Initializing cgroup subsys cpu
10:20:55 kernel: [ 0.000000] Linux version 3.9.10-100.fc17.x86_64 ([email protected]) (gcc version 4.7.2 20120921 (Red Hat 4.7.2-2) (GCC) ) #1 SMP Sun Jul 14 01
:31:27 UTC 2013
/etc/mdadm.conf:
ARRAY /dev/md0 metadata=1.2 name=nas:0 UUID=05f5ca2c:db826606:c2ae0648:2da1b4a0
MAILADDR ...
MAILFROM ...
/etc/smartd.conf:
(取自这里)
DEVICESCAN
-a \ # Implies all standard testing and reporting.
-n standby,10,q \ # Don't spin up disk if it is currently spun down
\ # unless it is 10th attempt in a row.
\ # Don't report unsuccessful attempts anyway.
-o on \ # Automatic offline tests (usually every 4 hours).
-S on \ # Attribute autosave (I don't really understand
\ # what it is for. If you can explain it to me
\ # please drop me a line.
-R 194 \ # Show real temperature in the logs.
-R 231 \ # The same as above.
-I 194 \ # Ignore temperature attribute changes
-W 3,50,50 \ # Notify if the temperature changes 3 degrees
\ # comparing to the last check or if
\ # the temperature exceeds 50 degrees.
-s (S/../.././02|L/../../1/22) \ # short test: every day between 2-3am
\ # long test every Monday between 10pm-2am
\ # (Long test takes a lot of time
\ # and it should be finished before
\ # daily short test starts.
\ # At 3am every day this disk will be
\ # utilized heavily as a backup storage)
-m root \ # To whom we should send mails.
-M exec /usr/libexec/smartmontools/smartdnotify
有谁知道是什么原因导致重启?
边注:
顺便问一下,消息日志的第二行是否暗示另一个驱动器故障?
SMART Prefailure Attribute: 1 Raw_Read_Error_Rate changed from 99 to 100
考虑到原来的四个驱动器(其中两个出现故障)是同时购买的,我猜剩下的两个驱动器也可能接近故障?