在 Linux 工作站上启用 SMART 磁盘通知的最佳实践?

在 Linux 工作站上启用 SMART 磁盘通知的最佳实践?

我在运行 Debian 的笔记本电脑上启用了 SMART 通知。基本上我只是想在磁盘出现问题时弹出通知。我不想收到电子邮件,我认为最好在我工作的工作站上显示通知(而电子邮件当然更适合服务器)。

它有效,我什至测试了它(但我到底测试了什么?),但我仍然怀疑我是否以正确的方式做到了这一点,以及我所做的是否真的有用。

基本上,我做了什么:

  1. 我安装了smartmontools并且smart-notifier
# apt-get install smartmontools smart-notifier
  1. 然后,我将smartd守护进程配置为监视/dev/sda并将其消息发送给通知程序。这是在 中完成的/etc/smartd.conf,其中我只有 1 行:
/dev/sda -a -m myUsername -M exec /usr/share/smartmontools/smartd-runner -M test
  1. -M test一旦我重新启动守护程序,上一个命令中的选项就会显示一个测试通知弹出窗口(smartd您必须注销并重新登录才能使其正常工作)。它可以工作,重新启动smartd守护程序会显示测试通知弹出窗口。
  2. 最后我删除了该-M test选项并smartd再次重新启动。

那么,我现在可以安心了吗?一旦出现问题,此设置会立即向我发送弹出窗口吗/dev/sda?我有很多未解答的问题:

  1. 使用该-M test选项,仅当我重新启动时才会显示测试通知弹出窗口smartd。当我重新启动笔记本电脑并登录时,没有显示任何内容(可能是因为smartd此时已经在运行)。如果smartd检测到我的磁盘出现问题,我是否可以确信会弹出通知?
  2. 到底什么事件会触发该弹出窗口?换句话说,什么是“出了问题”?$ man smartd指出:

smartd 将尝试在 ATA 设备上启用 SMART 监控(相当于 smartctl -s on),并每 30 分钟轮询一次这些设备和 SCSI 设备(可配置),通过 SYSLOG 接口记录 SMART 错误和 SMART 属性的更改。

事实上,检查后/var/log/syslog我可以看到 30 分钟后的日志条目(最后一行):

Jul 30 13:17:06 precision7520 smartd[20258]: smartd 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.5-amd64] (local build)
Jul 30 13:17:06 precision7520 smartd[20258]: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Jul 30 13:17:06 precision7520 smartd[20258]: Opened configuration file /etc/smartd.conf
Jul 30 13:17:06 precision7520 smartd[20258]: Configuration file /etc/smartd.conf parsed.
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], opened
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], Samsung SSD 850 EVO 2TB, S/N:S2RMNB0J801642K, WWN:5-002538-c407b1fd2, FW:EMT02B6Q, 2.00 TB
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], not found in smartd database.
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_EVO_2TB-S2RMNB0J801642K.ata.state
Jul 30 13:17:06 precision7520 smartd[20258]: Monitoring 1 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_EVO_2TB-S2RMNB0J801642K.ata.state


Jul 30 13:47:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68

但没有弹出。也许是因为日志条目只是次要信息(温度升高 1 度)?但是,到底什么样的事件应该触发通知呢?

  1. 最后,有很多例子/etc/smartd.conf,甚至更多$ man smartd.conf,一些在给定的时间间隔执行(-s)短(-s S)或扩展(-s L)自测试。那些自检有必要吗? SMART不是应该集成自己的自检程序(SMART的SM代表Self-Monitoring)吗?不进行自检的结果有多大用处?

仅供参考,我的# smartctl /dev/sda结果:

$ sudo smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.5-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     Samsung SSD 850 EVO 2TB
Serial Number:    S2RMNB0J801642K
LU WWN Device Id: 5 002538 c407b1fd2
Firmware Version: EMT02B6Q
User Capacity:    2 000 398 934 016 bytes [2,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Form Factor:      2.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Fri Jul 30 14:15:22 2021 WAT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
(...)

似乎从未执行过自检:

(...)
General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 265) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.
(...)

即使没有自检,这些数据还有用吗?

(...)
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       27805
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       1055
177 Wear_Leveling_Count     0x0013   099   099   000    Pre-fail  Always       -       21
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   099   010    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   067   043   000    Old_age   Always       -       33
195 Hardware_ECC_Recovered  0x001a   200   200   000    Old_age   Always       -       0
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
235 Unknown_Attribute       0x0012   099   099   000    Old_age   Always       -       71
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       26330052507

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     14903         -
# 2  Short offline       Completed without error       00%     14709         -
# 3  Short offline       Aborted by host               70%      2733         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

有很多问题,但基本上它们都归结为一个:在 Linux 工作站上启用 SMART 磁盘通知的最佳实践是什么?我有点惊讶谷歌搜索这个问题没有提供任何有用的信息

答案1

没有必要运行测试; SMART 收集运行时统计数据。

如果您的设备受支持(并非全部),请查看属性 5“Reallocated_Sector_Count” - 偶尔 - 如果该值不为零,则可能值得更频繁地检查以确保它不会显示任何突然增加。

由于重新分配扇区的存在表明您的设备发现某些扇区不可安全写入,从而导致使用备份扇区。也许是时候考虑更换它了。然而,设备可能会运行数月,但会出现一些错误扇区。

SMART只是一个指标可能的对于 SSD,寿命或写入的数据量可能是即将发生故障的更好指标。

对于您的型号,三星表示 Wear_Leveling_Count=21。这意味着您应该使用 SSD 驱动器上 21% 的可用单元,并且您已将 LBA 计数(以 512 字节为单位)写入驱动器。

“Unknown_Attribute” = 71 是系统计划外断电的次数。

相关内容