我在 Dell Poweredge T105 中安装了 OCZ-ARC100。当我用它启动系统(CentOS 7)时,后者显示 BDMA 错误:
jun 25 15:40:21 myhost kernel: ata4.00: ATA-8: OCZ-ARC100, 1.01, max UDMA/133
jun 25 15:40:21 myhost kernel: ata4.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 0/32)
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: scsi 3:0:0:0: Direct-Access ATA OCZ-ARC100 1.01 PQ: 0 ANSI: 5
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/111 GiB)
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write Protect is off
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
jun 25 15:40:21 myhost kernel: sda: sda1 sda2 sda3
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Attached SCSI disk
jun 25 15:40:21 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:21 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:21 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:21 myhost kernel: ata4.00: cmd c8/00:08:00:4b:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:00:4b:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:21 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:21 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: ata4: EH complete
...
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:d0:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:d0:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: limiting speed to UDMA/100:PIO4
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:f8:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
res 51/04:08:f8:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4: hard resetting link
jun 25 15:40:22 myhost kernel: ata4: nv: skipping hardreset on occupied port
jun 25 15:40:22 myhost kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/100
jun 25 15:40:22 myhost kernel: ata4: EH complete
我将 OCZ 插入 SATA 至 USB2 适配器并运行 smartctrl:
smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.6-gentoo-nvidia] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: OCZ-ARC100
Serial Number: A22L0061518000567
LU WWN Device Id: 5 e83a97 100061d69
Firmware Version: 1.01
User Capacity: 120.034.123.776 bytes [120 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ATA8-ACS (minor revision not indicated)
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sun Jun 25 15:28:55 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 000 000 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 252
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 84
171 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 39711824
174 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 10
195 Hardware_ECC_Recovered 0x0000 100 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 100 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 100 100 000 Old_age Offline - 0
208 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 5
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
224 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 100 000 Old_age Offline - 100
241 Total_LBAs_Written 0x0000 100 100 000 Old_age Offline - 92
242 Total_LBAs_Read 0x0000 100 100 000 Old_age Offline - 221
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3316691
SMART Error Log Version: 1
No Errors Logged
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Selective Self-tests/Logging not supported
显然这里没有错误的迹象。虽然我没有太注意 BMDMA 错误,最初以为驱动器会死掉,但现在我怀疑这是否是正确的诊断。我还被一个事实误导了:用全新的驱动器(Western Digital Blue 500GB)更换驱动器就像一个魅力,没有错误。然而不同之处在于,相比之下,OCZ 的速度实际上快得惊人。
我应该如何解释上面的这些错误(显然是 DMA 错误)以及如何解决此问题?例如,刷新 OCZ 固件?使用特定的内核参数?
顺便说一句,BIOS 强制ATA
要求 SATA 磁盘使用总线选项。例如,无法更改为 AHCI。我认为这是由于连接到 SATA 总线的 CD/DVD 驱动器或 Fusion MPT 硬件 Raid 适配器的存在造成的。无论如何,我在这里没有选择(字面上),但至少对于 WD 驱动器来说这似乎并不重要。
编辑:我从服务器本身运行了驱动器自测试,结果如下:
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.21.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0000 000 000 000 Old_age Offline - 0
9 Power_On_Hours 0x0000 100 100 000 Old_age Offline - 253
12 Power_Cycle_Count 0x0000 100 100 000 Old_age Offline - 85
171 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 39711824
174 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 10
195 Hardware_ECC_Recovered 0x0000 100 100 000 Old_age Offline - 0
196 Reallocated_Event_Count 0x0000 100 100 000 Old_age Offline - 0
197 Current_Pending_Sector 0x0000 100 100 000 Old_age Offline - 0
208 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 5
210 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 0
224 Unknown_SSD_Attribute 0x0000 100 100 000 Old_age Offline - 1
233 Media_Wearout_Indicator 0x0000 100 100 000 Old_age Offline - 100
241 Total_LBAs_Written 0x0000 100 100 000 Old_age Offline - 92
242 Total_LBAs_Read 0x0000 100 100 000 Old_age Offline - 222
249 Unknown_Attribute 0x0000 100 100 000 Old_age Offline - 3316768
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 253 -
另外,并且作为嗯提示smartctl 测试驱动器的内部结构,我想我可以放心地假设驱动器没有故障。我会再调查一些...