SMART - 了解离线数据收集

SMART - 了解离线数据收集

我在 Synology NAS 中有两个 Kingston A400 120GB SSD 作为缓存,它们似乎不支持自动离线数据收集。

# smartctl -d sat -c /dev/sdc | grep -i "Auto Offline data collection" 
Auto Offline Data Collection: Disabled.  
No Auto Offline data collection support.
# smartctl -d sat -o on /dev/sdc
SMART Automatic Timers not supported
SMART Enable Automatic Offline failed: scsi error aborted command

然而,当我检查标记为“离线”的属性时,其中RAW_VALUE一个属性不断变化(具体来说246 Total_Erase_Count),即使我不运行手动离线数据收集或自检。我检查了 smartd 是否正在运行以防万一,但它没有运行。另一个相同的 SSD 也发生了同样的事情。

问题:

  1. 离线数据收集究竟会更新什么?它只是更新属性表中的 VALUE/WORST/THRESH 列吗?
  2. 短期或长期自检会更新 SMART 属性数据吗?

输出smartctl -a

=== START OF INFORMATION SECTION ===
Model Family:     Phison Driven SSDs
Device Model:     KINGSTON SA400S37120G
Serial Number:    [...]
LU WWN Device Id: [...]
Firmware Version: 03070009
User Capacity:    120,034,123,776 bytes [120 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-3 T13/2161-D revision 4
SATA Version is:  SATA 3.2, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Fri Apr 12 01:55:30 2019 -03
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x02) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever
                                        been run.
Total time to complete Offline
data collection:                (    0) seconds.
Offline data collection
capabilities:                    (0x35) SMART execute Offline immediate.
                                        No Auto Offline data collection support.
                                        Abort Offline collection upon new
                                        command.
                                        No Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        No Selective Self-test supported.
SMART capabilities:            (0x0002) Does not save SMART data before
                                        entering power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x00) Error logging NOT supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        (   1) minutes.
Conveyance self-test routine
recommended polling time:        (   1) minutes.

SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME                                                   FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate                                              0x0032   100   100   000    Old_age   Always       -       0
  9 Power_On_Hours                                                   0x0032   100   100   000    Old_age   Always       -       710
 12 Power_Cycle_Count                                                0x0032   100   100   000    Old_age   Always       -       5
148 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       0
149 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       0
167 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       0
168 SATA_Phy_Error_Count                                             0x0012   100   100   000    Old_age   Always       -       0
169 Unknown_Attribute                                                0x0000   100   100   000    Old_age   Offline      -       65
170 Bad_Blk_Ct_Erl/Lat                                               0x0000   100   100   010    Old_age   Offline      -       0/78
172 Unknown_Attribute                                                0x0032   100   100   000    Old_age   Always       -       0
173 MaxAvgErase_Ct                                                   0x0000   100   100   000    Old_age   Offline      -       0
181 Program_Fail_Cnt_Total                                           0x0032   100   100   000    Old_age   Always       -       0
182 Erase_Fail_Count_Total                                           0x0000   100   100   000    Old_age   Offline      -       0
187 Reported_Uncorrect                                               0x0032   100   100   000    Old_age   Always       -       0
192 Unsafe_Shutdown_Count                                            0x0012   100   100   000    Old_age   Always       -       1
194 Temperature_Celsius                                              0x0022   024   025   000    Old_age   Always       -       24 (Min/Max 24/25)
196 Not_In_Use                                                       0x0032   100   100   000    Old_age   Always       -       0
199 CRC_Error_Count                                                  0x0032   100   100   000    Old_age   Always       -       0
218 CRC_Error_Count                                                  0x0032   100   100   000    Old_age   Always       -       4
231 SSD_Life_Left                                                    0x0000   100   100   000    Old_age   Offline      -       0
233 Flash_Writes_GiB                                                 0x0032   100   100   000    Old_age   Always       -       396
241 Lifetime_Writes_GiB                                              0x0032   100   100   000    Old_age   Always       -       304
242 Lifetime_Reads_GiB                                               0x0032   100   100   000    Old_age   Always       -       228
244 Average_Erase_Count                                              0x0000   100   100   000    Old_age   Offline      -       2
245 Max_Erase_Count                                                  0x0000   100   100   000    Old_age   Offline      -       10
246 Total_Erase_Count                                                0x0000   100   100   000    Old_age   Offline      -       3827

SMART Error Log not supported

SMART Self-test Log not supported

Selective Self-tests/Logging not supported

答案1

简短回答:SSD 将内部数据收集和报告封装在复杂的控制器和 FTL 固件后面,因此您在 SMART 级别看到的内容很少是其内部状态的完整表示。不必担心离线测试似乎被禁用,因为很可能控制器运行自己的健全性测试并相应地更新在线和离线属性(除非不这样做 - 一些固件故意破坏 SMART 属性,但这种情况甚至发生在 HDD 上,你对此无能为力)。

长答案: SMART offline data collection是一种定义不明确的磁盘数据收集方式,原则上,这会降低 IO 性能,因为特定测试/收集无法真正与用户数据 IO 并行运行。因此,出现了“离线”一词 - 磁盘固件可以在离线属性收集期间自由暂停用户 IO。因此,可以完全禁用离线收集,在预定时间向用户明确请求离线收集,或者(如果磁盘支持)使用编程计时器自动运行离线收集。

然而,离线测试从未正式纳入 ATA 标准(尽管存在于其他存储相关标准中),这为(通常未记录的)固件特定行为留下了隐患。

对于我过去 15 年来使用过的任何磁盘,离线测试确实是“在线”测试,在数据收集过程中没有性能下降。与在线测试的唯一区别在于,离线测试是按照特定的固件相关时间表收集的(即每 4 小时一次)。

我发现的唯一例外是关于Offline surface scan,这是一项特定的离线子测试,它会扫描整个盘片表面(或 NAND 芯片,对于 SSD)以查找缺陷。作为一项如此密集的测试,它会被特别报告,有时可以选择性地启用/禁用。然而,大多数 HDD(和 SSD)报告表面扫描不受支持,而是实施固件和特定型号的扫描。例如,大多数消费级 HDD 根本不进行表面扫描,而企业级磁盘即使 SMART 报告表面扫描已禁用也会自动扫描其表面。SSD 要复杂得多,控制器是必需的定期扫描闪存状态来重写边缘页面,因此表面扫描对它们来说基本没有意义。

相关内容