昨天,OSSEC向我发送了一封警告邮件:
Jul 29 21:25:16 SVR4149 kernel: end_request: I/O error, dev sda, sector 334634969
Jul 29 21:25:16 SVR4149 kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
Jul 29 21:25:16 SVR4149 kernel: end_request: I/O error, dev sda, sector 334634977
Jul 29 21:28:28 SVR4149 kernel: sd 0:0:0:0: SCSI error: return code = 0x00040000
令人惊奇的是,当时我只有一个/dev/sdb
设备。
# fdisk -l
Disk /dev/sdb: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 13 104391 83 Linux
/dev/sdb2 14 7662 61440592+ 83 Linux
/dev/sdb3 7663 8706 8385930 82 Linux swap / Solaris
/dev/sdb4 8707 38888 242436915 5 Extended
/dev/sdb5 8707 38888 242436883+ 83 Linux
经过谷歌搜索后,我发现这链接。执行建议的命令,它会带回我的/dev/sdc
:
Jul 29 22:55:45 SVR4149 kernel: ata1: hard resetting link
Jul 29 22:55:45 SVR4149 kernel: ata1: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Jul 29 22:55:45 SVR4149 kernel: ata1.00: ATA-8: WDC WD3202ABYS-01B7A0, 02.03B02, max UDMA/133
Jul 29 22:55:45 SVR4149 kernel: ata1.00: 625142448 sectors, multi 0: LBA48 NCQ (depth 31/32)
Jul 29 22:55:45 SVR4149 kernel: ata1.00: configured for UDMA/133
Jul 29 22:55:45 SVR4149 kernel: ata1: EH complete
Jul 29 22:55:45 SVR4149 kernel: ata1.00: detaching (SCSI 0:0:0:0)
Jul 29 22:55:45 SVR4149 kernel: Vendor: ATA Model: WDC WD3202ABYS-0 Rev: 02.0
Jul 29 22:55:45 SVR4149 kernel: Type: Direct-Access ANSI SCSI revision: 05
Jul 29 22:55:45 SVR4149 kernel: SCSI device sdc: 625142448 512-byte hdwr sectors (320073 MB)
Jul 29 22:55:45 SVR4149 kernel: sdc: Write Protect is off
Jul 29 22:55:45 SVR4149 kernel: sdc: Mode Sense: 00 3a 00 00
Jul 29 22:55:45 SVR4149 kernel: SCSI device sdc: drive cache: write back
Jul 29 22:55:53 SVR4149 kernel: SCSI device sdc: 625142448 512-byte hdwr sectors (320073 MB)
Jul 29 22:55:53 SVR4149 kernel: sdc: sdc1 sdc2 sdc3 sdc4 < sdc5 >
Jul 29 22:55:53 SVR4149 kernel: sd 0:0:0:0: Attached scsi disk sdc
Jul 29 22:55:53 SVR4149 kernel: sd 0:0:0:0: Attached scsi generic sg0 type 0
重新检查fdisk
:
# fdisk -l
Disk /dev/sdb: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdb1 * 1 13 104391 83 Linux
/dev/sdb2 14 7662 61440592+ 83 Linux
/dev/sdb3 7663 8706 8385930 82 Linux swap / Solaris
/dev/sdb4 8707 38888 242436915 5 Extended
/dev/sdb5 8707 38888 242436883+ 83 Linux
Disk /dev/sdc: 320.0 GB, 320072933376 bytes
255 heads, 63 sectors/track, 38913 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Device Boot Start End Blocks Id System
/dev/sdc1 * 1 13 104391 83 Linux
/dev/sdc2 14 7662 61440592+ 83 Linux
/dev/sdc3 7663 8706 8385930 82 Linux swap / Solaris
/dev/sdc4 8707 38888 242436915 5 Extended
/dev/sdc5 8707 38888 242436883+ 83 Linux
但是我从内核日志中发现了另一个问题:
Jul 30 01:03:41 SVR4149 kernel: scsi 0:0:0:0: rejecting I/O to dead device
Jul 30 01:14:40 SVR4149 kernel: scsi 0:0:0:0: rejecting I/O to dead device
Jul 30 01:16:41 SVR4149 kernel: scsi 0:0:0:0: rejecting I/O to dead device
Jul 30 01:53:18 SVR4149 last message repeated 7 times
并smartd
继续打开不存在的设备:
Jul 30 10:00:57 SVR4149 smartd[3749]: Device: /dev/sda, No such device, open() failed
smartd.conf
我的文件中没有什么特殊之处:
# grep -v "^#" /etc/smartd.conf | sed '/^$/d'
DEVICESCAN -H -m root
我的 scsi0 会“死掉”吗?
cat /proc/scsi/scsi
Attached devices:
Host: scsi1 Channel: 00 Id: 00 Lun: 00
Vendor: ATA Model: WDC WD3202ABYS-0 Rev: 02.0
Type: Direct-Access ANSI SCSI revision: 05
Host: scsi0 Channel: 00 Id: 00 Lun: 00
Vendor: ATA Model: WDC WD3202ABYS-0 Rev: 02.0
Type: Direct-Access ANSI SCSI revision: 05
任何帮助将不胜感激。
答案1
看起来驱动器正在掉线然后重新连接。这表明以下三件事之一:
- 最有可能是一个坏的驱动器,我将开始检查 SMART 日志并查看在那里发现了什么。
- 电缆/SCSI 控制器损坏(通常是 RAID 卡)... 如果 SMART 检查无误并且此问题持续存在,请先更换电缆,然后再更换卡。
- 您正在进行如此多的持续磁盘 I/O,以致磁盘控制器过载...您应该能够判断是否正在使 I/O 过载。
希望这能有所帮助...这是一个令人恐惧的消息。