SSD 给出 DMA 启动错误,而 smartctrl 显示没有错误

SSD 给出 DMA 启动错误,而 smartctrl 显示没有错误

我在 Dell Poweredge T105 中安装了 OCZ-ARC100。当我用它启动系统(CentOS 7)时,后者显示 BDMA 错误:

jun 25 15:40:21 myhost kernel: ata4.00: ATA-8: OCZ-ARC100, 1.01, max UDMA/133
jun 25 15:40:21 myhost kernel: ata4.00: 234441648 sectors, multi 1: LBA48 NCQ (depth 0/32)
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: scsi 3:0:0:0: Direct-Access     ATA      OCZ-ARC100       1.01 PQ: 0 ANSI: 5
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] 234441648 512-byte logical blocks: (120 GB/111 GiB)
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write Protect is off
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Mode Sense: 00 3a 00 00
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
jun 25 15:40:21 myhost kernel:  sda: sda1 sda2 sda3
jun 25 15:40:21 myhost kernel: sd 3:0:0:0: [sda] Attached SCSI disk
jun 25 15:40:21 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:21 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:21 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:21 myhost kernel: ata4.00: cmd c8/00:08:00:4b:f9/00:00:00:00:00/ed tag 0 dma 4096 in
                                                        res 51/04:08:00:4b:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:21 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:21 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:21 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:21 myhost kernel: ata4: EH complete
...
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:d0:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
                                                        res 51/04:08:d0:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/133
jun 25 15:40:22 myhost kernel: ata4: EH complete
jun 25 15:40:22 myhost kernel: ata4.00: limiting speed to UDMA/100:PIO4
jun 25 15:40:22 myhost kernel: ata4.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6
jun 25 15:40:22 myhost kernel: ata4.00: BMDMA stat 0x5
jun 25 15:40:22 myhost kernel: ata4.00: failed command: READ DMA
jun 25 15:40:22 myhost kernel: ata4.00: cmd c8/00:08:f8:47:f9/00:00:00:00:00/ed tag 0 dma 4096 in
                                                        res 51/04:08:f8:47:f9/00:00:00:00:00/ed Emask 0x1 (device error)
jun 25 15:40:22 myhost kernel: ata4.00: status: { DRDY ERR }
jun 25 15:40:22 myhost kernel: ata4.00: error: { ABRT }
jun 25 15:40:22 myhost kernel: ata4: hard resetting link
jun 25 15:40:22 myhost kernel: ata4: nv: skipping hardreset on occupied port
jun 25 15:40:22 myhost kernel: ata4: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
jun 25 15:40:22 myhost kernel: ata4.00: configured for UDMA/100
jun 25 15:40:22 myhost kernel: ata4: EH complete

我将 OCZ 插入 SATA 至 USB2 适配器并运行 smartctrl:

smartctl 6.4 2015-06-04 r4109 [x86_64-linux-4.4.6-gentoo-nvidia] (local build)
Copyright (C) 2002-15, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     OCZ-ARC100
Serial Number:    A22L0061518000567
LU WWN Device Id: 5 e83a97 100061d69
Firmware Version: 1.01
User Capacity:    120.034.123.776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ATA8-ACS (minor revision not indicated)
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sun Jun 25 15:28:55 2017 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x1d) SMART execute Offline immediate.
                    No Auto Offline data collection support.
                    Abort Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    No Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   0) minutes.
Extended self-test routine
recommended polling time:    (   0) minutes.

SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0000   000   000   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       252
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       84
171 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       39711824
174 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       10
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
208 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       5
210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
224 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       1
233 Media_Wearout_Indicator 0x0000   100   100   000    Old_age   Offline      -       100
241 Total_LBAs_Written      0x0000   100   100   000    Old_age   Offline      -       92
242 Total_LBAs_Read         0x0000   100   100   000    Old_age   Offline      -       221
249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3316691

SMART Error Log Version: 1
No Errors Logged

Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Selective Self-tests/Logging not supported

显然这里没有错误的迹象。虽然我没有太注意 BMDMA 错误,最初以为驱动器会死掉,但现在我怀疑这是否是正确的诊断。我还被一个事实误导了:用全新的驱动器(Western Digital Blue 500GB)更换驱动器就像一个魅力,没有错误。然而不同之处在于,相比之下,OCZ 的速度实际上快得惊人。

我应该如何解释上面的这些错误(显然是 DMA 错误)以及如何解决此问题?例如,刷新 OCZ 固件?使用特定的内核参数?

顺便说一句,BIOS 强制ATA要求 SATA 磁盘使用总线选项。例如,无法更改为 AHCI。我认为这是由于连接到 SATA 总线的 CD/DVD 驱动器或 Fusion MPT 硬件 Raid 适配器的存在造成的。无论如何,我在这里没有选择(字面上),但至少对于 WD 驱动器来说这似乎并不重要。


编辑:我从服务器本身运行了驱动器自测试,结果如下:

smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.10.0-514.21.1.el7.x86_64] (local build)
Copyright (C) 2002-13, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 18
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0000   000   000   000    Old_age   Offline      -       0
  9 Power_On_Hours          0x0000   100   100   000    Old_age   Offline      -       253
 12 Power_Cycle_Count       0x0000   100   100   000    Old_age   Offline      -       85
171 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       39711824
174 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       10
195 Hardware_ECC_Recovered  0x0000   100   100   000    Old_age   Offline      -       0
196 Reallocated_Event_Count 0x0000   100   100   000    Old_age   Offline      -       0
197 Current_Pending_Sector  0x0000   100   100   000    Old_age   Offline      -       0
208 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       5
210 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       0
224 Unknown_SSD_Attribute   0x0000   100   100   000    Old_age   Offline      -       1
233 Media_Wearout_Indicator 0x0000   100   100   000    Old_age   Offline      -       100
241 Total_LBAs_Written      0x0000   100   100   000    Old_age   Offline      -       92
242 Total_LBAs_Read         0x0000   100   100   000    Old_age   Offline      -       222
249 Unknown_Attribute       0x0000   100   100   000    Old_age   Offline      -       3316768

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%       253         -

另外,并且作为嗯提示smartctl 测试驱动器的内部结构,我想我可以放心地假设驱动器没有故障。我会再调查一些...

相关内容