RAID5(mdadm)问题-磁盘分离

RAID5(mdadm)问题-磁盘分离

/var/log/syslog 中有这些行

 Apr 18 16:53:05 Server kernel: [4487878.816036] ata4: EH in SWNCQ mode,QC:qc_active 0x1 sactive 0x1
    Apr 18 16:53:05 Server kernel: [4487878.816058] ata4: SWNCQ:qc_active 0x1 defer_bits 0x0 last_issue_tag 0x0
    Apr 18 16:53:05 Server kernel: [4487878.816059]   dhfis 0x1 dmafis 0x1 sdbfis 0x0
    Apr 18 16:53:05 Server kernel: [4487878.816093] ata4: ATA_REG 0x40 ERR_REG 0x0
    Apr 18 16:53:05 Server kernel: [4487878.816108] ata4: tag : dhfis dmafis sdbfis sacitve
    Apr 18 16:53:05 Server kernel: [4487878.816125] ata4: tag 0x0: 1 1 0 1
    Apr 18 16:53:05 Server kernel: [4487878.816150] ata4.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
    Apr 18 16:53:05 Server kernel: [4487878.816178] ata4.00: failed command: WRITE FPDMA QUEUED
    Apr 18 16:53:05 Server kernel: [4487878.816199] ata4.00: cmd 61/08:00:00:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
    Apr 18 16:53:05 Server kernel: [4487878.816200]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
    Apr 18 16:53:05 Server kernel: [4487878.816253] ata4.00: status: { DRDY }
    Apr 18 16:53:05 Server kernel: [4487878.816272] ata4: hard resetting link
    Apr 18 16:53:05 Server kernel: [4487878.816274] ata4: nv: skipping hardreset on occupied port
    Apr 18 16:53:06 Server kernel: [4487879.676029] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    Apr 18 16:53:07 Server kernel: [4487880.416749] ata4.00: n_sectors mismatch 3907029168 != 268435455
    Apr 18 16:53:07 Server kernel: [4487880.416752] ata4.00: revalidation failed (errno=-19)
    Apr 18 16:53:07 Server kernel: [4487880.416773] ata4.00: limiting speed to UDMA/133:PIO2
    Apr 18 16:53:11 Server kernel: [4487884.676024] ata4: hard resetting link
    Apr 18 16:53:11 Server kernel: [4487884.676027] ata4: nv: skipping hardreset on occupied port
    Apr 18 16:53:12 Server kernel: [4487885.144032] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    Apr 18 16:53:12 Server kernel: [4487885.240185] ata4.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
    Apr 18 16:53:12 Server kernel: [4487885.240190] ata4.00: revalidation failed (errno=-5)
    Apr 18 16:53:12 Server kernel: [4487885.240210] ata4.00: disabled
    Apr 18 16:53:17 Server kernel: [4487890.144023] ata4: hard resetting link
    Apr 18 16:53:17 Server kernel: [4487891.024033] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
    Apr 18 16:53:17 Server kernel: [4487891.033357] ata4.00: ATA-8: WDC WD20EARS-00S8B1, 80.00A80, max UDMA/133
    Apr 18 16:53:17 Server kernel: [4487891.033360] ata4.00: 3907029168 sectors, multi 1: LBA48 NCQ (depth 31/32)
    Apr 18 16:53:17 Server kernel: [4487891.048347] ata4.00: configured for UDMA/133
    Apr 18 16:53:17 Server kernel: [4487891.048361] sd 3:0:0:0: [sdc] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    Apr 18 16:53:17 Server kernel: [4487891.048365] sd 3:0:0:0: [sdc] Sense Key : Aborted Command [current] [descriptor]
    Apr 18 16:53:17 Server kernel: [4487891.048369] Descriptor sense data with sense descriptors (in hex):
    Apr 18 16:53:17 Server kernel: [4487891.048371]         72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
    Apr 18 16:53:17 Server kernel: [4487891.048378]         00 00 00 00
    Apr 18 16:53:17 Server kernel: [4487891.048382] sd 3:0:0:0: [sdc] Add. Sense: No additional sense information
    Apr 18 16:53:17 Server kernel: [4487891.048385] sd 3:0:0:0: [sdc] CDB: Write(10): 2a 00 e8 e0 88 00 00 00 08 00
    Apr 18 16:53:17 Server kernel: [4487891.048393] end_request: I/O error, dev sdc, sector 3907028992
    Apr 18 16:53:17 Server kernel: [4487891.048420] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048440] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048458] end_request: I/O error, dev sdc, sector 3907028992
    Apr 18 16:53:17 Server kernel: [4487891.048477] md: super_written gets error=-5, uptodate=0
    Apr 18 16:53:17 Server kernel: [4487891.048482] raid5: Disk failure on sdc, disabling device.
    Apr 18 16:53:17 Server kernel: [4487891.048483] raid5: Operation continuing on 3 devices.
    Apr 18 16:53:17 Server kernel: [4487891.048525] ata4: EH complete
    Apr 18 16:53:17 Server kernel: [4487891.048554] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048576] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048596] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048615] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
    Apr 18 16:53:17 Server kernel: [4487891.048617] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    Apr 18 16:53:17 Server kernel: [4487891.048620] sd 3:0:0:0: [sdc] Sense not available.
    Apr 18 16:53:17 Server kernel: [4487891.048624] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048643] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048663] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048681] sd 3:0:0:0: [sdc] READ CAPACITY failed
    Apr 18 16:53:17 Server kernel: [4487891.048683] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
    Apr 18 16:53:17 Server kernel: [4487891.048685] sd 3:0:0:0: [sdc] Sense not available.
    Apr 18 16:53:17 Server kernel: [4487891.048689] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048709] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048800] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.048860] sd 3:0:0:0: rejecting I/O to offline device
    Apr 18 16:53:17 Server kernel: [4487891.049028] sd 3:0:0:0: [sdc] Asking for cache data failed
    Apr 18 16:53:17 Server kernel: [4487891.049048] sd 3:0:0:0: [sdc] Assuming drive cache: write through
    Apr 18 16:53:17 Server kernel: [4487891.049071] sdc: detected capacity change from 2000398934016 to 0
    Apr 18 16:53:17 Server kernel: [4487891.049080] ata4.00: detaching (SCSI 3:0:0:0)
    Apr 18 16:53:18 Server kernel: [4487891.061149] sd 3:0:0:0: [sdc] Stopping disk
    Apr 18 16:53:18 Server kernel: [4487891.485492] RAID5 conf printout:
    Apr 18 16:53:18 Server kernel: [4487891.485496]  --- rd:4 wd:3
    Apr 18 16:53:18 Server kernel: [4487891.485500]  disk 0, o:1, dev:sdb
    Apr 18 16:53:18 Server kernel: [4487891.485502]  disk 1, o:0, dev:sdc
    Apr 18 16:53:18 Server kernel: [4487891.485504]  disk 2, o:1, dev:sdd
    Apr 18 16:53:18 Server kernel: [4487891.485506]  disk 3, o:1, dev:sde
    Apr 18 16:53:18 Server kernel: [4487891.497014] RAID5 conf printout:
    Apr 18 16:53:18 Server kernel: [4487891.497016]  --- rd:4 wd:3
    Apr 18 16:53:18 Server kernel: [4487891.497018]  disk 0, o:1, dev:sdb
    Apr 18 16:53:18 Server kernel: [4487891.497019]  disk 2, o:1, dev:sdd
    Apr 18 16:53:18 Server kernel: [4487891.497021]  disk 3, o:1, dev:sde
    Apr 18 16:53:18 Server kernel: [4487891.838719] scsi 3:0:0:0: Direct-Access     ATA      WDC WD20EARS-00S 80.0 PQ: 0 ANSI: 5
    Apr 18 16:53:18 Server kernel: [4487891.838886] sd 3:0:0:0: Attached scsi generic sg3 type 0
    Apr 18 16:53:18 Server kernel: [4487891.838911] sd 3:0:0:0: [sdf] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
    Apr 18 16:53:18 Server kernel: [4487891.838964] sd 3:0:0:0: [sdf] Write Protect is off
    Apr 18 16:53:18 Server kernel: [4487891.838967] sd 3:0:0:0: [sdf] Mode Sense: 00 3a 00 00
    Apr 18 16:53:18 Server kernel: [4487891.838988] sd 3:0:0:0: [sdf] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
    Apr 18 16:53:20 Server kernel: [4487891.839147]  sdf: unknown partition table
    Apr 18 16:53:20 Server kernel: [4487893.130026] sd 3:0:0:0: [sdf] Attached SCSI disk

现在,我无法在 /dev/sdc 上执行任何操作。有没有办法尝试重新连接它?除非绝对必要,否则我不想关闭服务器

系统:

  • Debian 稳定版 2.6.32-5-amd64
  • mdadm 版本 3.1.4-1+8efb9d1

猫/proc/mdstat

Personalities : [raid6] [raid5] [raid4]
md0 : active raid5 sdb[0] sdc[4](F) sde[3] sdd[2]
      5860543488 blocks level 5, 64k chunk, algorithm 2 [4/3] [U_UU]

unused devices: <none>

mdadm --检查 --扫描

ARRAY /dev/md0 UUID=1a7744b5:912ec7af:f82a9565:e3b453b4

答案1

对 /proc 文件系统尝试以下操作:

http://tldp.org/HOWTO/SCSI-2.4-HOWTO/mlproc.html

答案2

我不确定您认为将故障磁盘重新添加到阵列中会有什么好处。这些错误不是软错误 - 磁盘即将出故障。

Apr 18 16:53:05 Server kernel: [4487878.816178] ata4.00: failed command: WRITE FPDMA QUEUED
Apr 18 16:53:05 Server kernel: [4487878.816199] ata4.00: cmd 61/08:00:00:88:e0/00:00:e8:00:00/40 tag 0 ncq 4096 out
Apr 18 16:53:05 Server kernel: [4487878.816200]          res 40/00:00:01:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Apr 18 16:53:05 Server kernel: [4487878.816253] ata4.00: status: { DRDY }
Apr 18 16:53:05 Server kernel: [4487878.816272] ata4: hard resetting link
Apr 18 16:53:05 Server kernel: [4487878.816274] ata4: nv: skipping hardreset on occupied port
Apr 18 16:53:06 Server kernel: [4487879.676029] ata4: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Apr 18 16:53:07 Server kernel: [4487880.416749] ata4.00: n_sectors mismatch 3907029168 != 268435455
Apr 18 16:53:07 Server kernel: [4487880.416752] ata4.00: revalidation failed (errno=-19)

写入命令失败,重置链接,现在看到驱动器上的扇区不匹配。

Apr 18 16:53:12 Server kernel: [4487885.240185] ata4.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x80)
Apr 18 16:53:12 Server kernel: [4487885.240190] ata4.00: revalidation failed (errno=-5)
Apr 18 16:53:12 Server kernel: [4487885.240210] ata4.00: disabled

无法响应 IDENTIFY 命令。

Apr 18 16:53:17 Server kernel: [4487891.048615] sd 3:0:0:0: [sdc] READ CAPACITY(16) failed
Apr 18 16:53:17 Server kernel: [4487891.048617] sd 3:0:0:0: [sdc] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK
Apr 18 16:53:17 Server kernel: [4487891.048620] sd 3:0:0:0: [sdc] Sense not available.

驱动器无法响应 READ CAPACITY 命令。

磁盘返回到向 Linux 呈现块设备这一事实只是一种障眼法。您应该更换它,而不是花时间尝试将看起来非常像故障的磁盘重新放入 RAID 阵列。即使您确实将其重新放入,它也会很快再次发生故障,默默地损坏您的数据,或者两者兼而有之。

从技术上讲,更换 SATA 磁盘不需要关闭磁盘电源。我知道您的机箱可能没有热插拔托架,并且可能无法让您轻松更换磁盘,但您可以考虑借此机会安装 SATA 热插拔托架适配器。类似例如,Addonics 的 - 可装入 3 个 5.25 英寸托架,并提供 5 个 3.5 英寸热插拔访问驱动器托盘。使更换磁盘变得容易得多。

答案3

我在使用 Marvell 控制器时也遇到了同样的问题。我禁用了 NCQ,这种情况再也没有发生。

echo 1 > /sys/block/YOUR_DEVICE/device/queue_depth

相关内容