我在运行 Debian 的笔记本电脑上启用了 SMART 通知。基本上我只是想在磁盘出现问题时弹出通知。我不想收到电子邮件,我认为最好在我工作的工作站上显示通知(而电子邮件当然更适合服务器)。
它有效,我什至测试了它(但我到底测试了什么?),但我仍然怀疑我是否以正确的方式做到了这一点,以及我所做的是否真的有用。
基本上,我做了什么:
- 我安装了
smartmontools
并且smart-notifier
# apt-get install smartmontools smart-notifier
- 然后,我将
smartd
守护进程配置为监视/dev/sda
并将其消息发送给通知程序。这是在 中完成的/etc/smartd.conf
,其中我只有 1 行:
/dev/sda -a -m myUsername -M exec /usr/share/smartmontools/smartd-runner -M test
-M test
一旦我重新启动守护程序,上一个命令中的选项就会显示一个测试通知弹出窗口(smartd
您必须注销并重新登录才能使其正常工作)。它可以工作,重新启动smartd
守护程序会显示测试通知弹出窗口。- 最后我删除了该
-M test
选项并smartd
再次重新启动。
那么,我现在可以安心了吗?一旦出现问题,此设置会立即向我发送弹出窗口吗/dev/sda
?我有很多未解答的问题:
- 使用该
-M test
选项,仅当我重新启动时才会显示测试通知弹出窗口smartd
。当我重新启动笔记本电脑并登录时,没有显示任何内容(可能是因为smartd
此时已经在运行)。如果smartd
检测到我的磁盘出现问题,我是否可以确信会弹出通知? - 到底什么事件会触发该弹出窗口?换句话说,什么是“出了问题”?
$ man smartd
指出:
smartd 将尝试在 ATA 设备上启用 SMART 监控(相当于 smartctl -s on),并每 30 分钟轮询一次这些设备和 SCSI 设备(可配置),通过 SYSLOG 接口记录 SMART 错误和 SMART 属性的更改。
事实上,检查后/var/log/syslog
我可以看到 30 分钟后的日志条目(最后一行):
Jul 30 13:17:06 precision7520 smartd[20258]: smartd 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.5-amd64] (local build)
Jul 30 13:17:06 precision7520 smartd[20258]: Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
Jul 30 13:17:06 precision7520 smartd[20258]: Opened configuration file /etc/smartd.conf
Jul 30 13:17:06 precision7520 smartd[20258]: Configuration file /etc/smartd.conf parsed.
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], opened
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], Samsung SSD 850 EVO 2TB, S/N:S2RMNB0J801642K, WWN:5-002538-c407b1fd2, FW:EMT02B6Q, 2.00 TB
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], not found in smartd database.
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], can't monitor Current_Pending_Sector count - no Attribute 197
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], can't monitor Offline_Uncorrectable count - no Attribute 198
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], is SMART capable. Adding to "monitor" list.
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], state read from /var/lib/smartmontools/smartd.Samsung_SSD_850_EVO_2TB-S2RMNB0J801642K.ata.state
Jul 30 13:17:06 precision7520 smartd[20258]: Monitoring 1 ATA/SATA, 0 SCSI/SAS and 0 NVMe devices
Jul 30 13:17:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], state written to /var/lib/smartmontools/smartd.Samsung_SSD_850_EVO_2TB-S2RMNB0J801642K.ata.state
Jul 30 13:47:06 precision7520 smartd[20258]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Airflow_Temperature_Cel changed from 67 to 68
但没有弹出。也许是因为日志条目只是次要信息(温度升高 1 度)?但是,到底什么样的事件应该触发通知呢?
- 最后,有很多例子
/etc/smartd.conf
,甚至更多$ man smartd.conf
,一些在给定的时间间隔执行(-s
)短(-s S
)或扩展(-s L
)自测试。那些自检有必要吗? SMART不是应该集成自己的自检程序(SMART的SM代表Self-Monitoring)吗?不进行自检的结果有多大用处?
仅供参考,我的# smartctl /dev/sda
结果:
$ sudo smartctl -a /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.19.0-0.bpo.5-amd64] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: Samsung SSD 850 EVO 2TB
Serial Number: S2RMNB0J801642K
LU WWN Device Id: 5 002538 c407b1fd2
Firmware Version: EMT02B6Q
User Capacity: 2 000 398 934 016 bytes [2,00 TB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Form Factor: 2.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Fri Jul 30 14:15:22 2021 WAT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
(...)
似乎从未执行过自检:
(...)
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 265) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
(...)
即使没有自检,这些数据还有用吗?
(...)
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 27805
12 Power_Cycle_Count 0x0032 098 098 000 Old_age Always - 1055
177 Wear_Leveling_Count 0x0013 099 099 000 Pre-fail Always - 21
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 099 010 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 067 043 000 Old_age Always - 33
195 Hardware_ECC_Recovered 0x001a 200 200 000 Old_age Always - 0
199 UDMA_CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 Unknown_Attribute 0x0012 099 099 000 Old_age Always - 71
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 26330052507
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Short offline Completed without error 00% 14903 -
# 2 Short offline Completed without error 00% 14709 -
# 3 Short offline Aborted by host 70% 2733 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
255 0 65535 Read_scanning was never started
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
有很多问题,但基本上它们都归结为一个:在 Linux 工作站上启用 SMART 磁盘通知的最佳实践是什么?我有点惊讶谷歌搜索这个问题没有提供任何有用的信息
答案1
没有必要运行测试; SMART 收集运行时统计数据。
如果您的设备受支持(并非全部),请查看属性 5“Reallocated_Sector_Count” - 偶尔 - 如果该值不为零,则可能值得更频繁地检查以确保它不会显示任何突然增加。
由于重新分配扇区的存在表明您的设备发现某些扇区不可安全写入,从而导致使用备份扇区。也许是时候考虑更换它了。然而,设备可能会运行数月,但会出现一些错误扇区。
SMART只是一个指标可能的对于 SSD,寿命或写入的数据量可能是即将发生故障的更好指标。
对于您的型号,三星表示 Wear_Leveling_Count=21。这意味着您应该使用 SSD 驱动器上 21% 的可用单元,并且您已将 LBA 计数(以 512 字节为单位)写入驱动器。
“Unknown_Attribute” = 71 是系统计划外断电的次数。