驱动器发生故障,但 LSI MegaRAID 控制器未检测到它

驱动器发生故障,但 LSI MegaRAID 控制器未检测到它

smartmontools 报告 RAID1 配置中使用的驱动器上不可读扇区的数量不断增加。我认为 LSI MegaRAID 控制器也会检查其磁盘驱动器的 SMART 状态,因此应该将驱动器识别为故障并将其标记为脱机?

smartctl -d sat+megaraid,7 -a /dev/sda 的输出:

...
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       69
...
Error 11 occurred at disk power-on lifetime: 9704 hours (404 days + 8 hours)
When the command that caused the error occurred, the device was active or idle.

After command completion occurred, registers were:
ER ST SC SN CL CH DH
-- -- -- -- -- -- --
40 51 11 6f cd 04 0f  Error: UNC at LBA = 0x0f04cd6f = 251972975

Commands leading to the command that caused the error were:
CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
-- -- -- -- -- -- -- --  ----------------  --------------------
60 69 38 17 cd 04 40 00   2d+11:27:29.750  READ FPDMA QUEUED
61 10 30 98 12 55 40 00   2d+11:27:29.750  WRITE FPDMA QUEUED
61 01 28 57 86 da 40 00   2d+11:27:29.750  WRITE FPDMA QUEUED
60 09 20 f7 d1 04 40 00   2d+11:27:29.750  READ FPDMA QUEUED
60 80 18 00 d2 04 40 00   2d+11:27:29.750  READ FPDMA QUEUED
...
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      9700         -
# 2  Short offline       Completed without error       00%      9676         -
# 3  Extended offline    Completed: read failure       90%      9673         251972659

MegaCli -AdpAllInfo -aAll 的输出:

Product Name    : LSI MegaRAID SAS 9260-4i
...
================
Virtual Drives    : 2
  Degraded        : 0
  Offline         : 0
Physical Devices  : 5
  Disks           : 4
  Critical Disks  : 0
  Failed Disks    : 0

请告知 RAID 控制器行为是否正常或是否存在错误配置。控制器应处于出厂状态,我只将四个物理磁盘配置为两个 RAID1 卷。

无论如何,坏的磁盘都会被替换。

更新:我了解到实际上有一种方法可以了解此类错误(见下文),但是我认为此类信息将显示在更突出的状态信息中,而不是埋在日志文件中。

看来 RAID 控制器没有标记该磁盘,因为它仍然可以从该错误情况中恢复。

答案1

要查看 RAID 控制器日志,请运行以下命令:

/opt/MegaRAID/MegaCli/MegaCli -AdpEventLog -GetLatest 1000 -f events.log -aALL

events.log 文件包含如下条目,表明磁盘存在问题:

Code: 0x0000006e
Class: 0
Locale: 0x02
Event Description: Corrected medium error during recovery on PD 07(e0xfc/s2) at f04cb53
Event Data:
===========
Device ID: 7
Enclosure Index: 252
Slot Number: 2
LBA: 251972435


seqNum: 0x00004f65
Time: Wed Mar  6 05:36:48 2013

Code: 0x00000071
Class: 0
Locale: 0x02
Event Description: Unexpected sense: PD 07(e0xfc/s2) Path 4433221101000000, CDB: 28 00 0f 04 d1 f7 00 01 e0 00, Sense: 3/11/00
Event Data:
===========
Device ID: 7
Enclosure Index: 252
Slot Number: 2
CDB Length: 10
CDB Data:
0028 0000 000f 0004 00d1 00f7 0000 0001 00e0 0000 0000 0000 0000 0000 0000 0000 Sense Length: 18
Sense Data:
00f0 0000 0003 000f 0004 00d2 0046 000a 0000 0000 0000 0000 0011 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000

seqNum: 0x00004f64
Time: Wed Mar  6 05:36:43 2013

相关内容