我们的一块三星 2TB NVME SSD 最近出现故障,因此我们将其换成了新的,并开始密切关注 SMART 测试。
以下是不到两周前安装的驱动器的输出:
root@~ $ smartctl -a /dev/nvme0n1p1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-53-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung SSD 970 EVO Plus 2TB
Serial Number: S59CNZFNA02015F
Firmware Version: 2B2QEXM7
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 4
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 129,469,706,240 [129 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 002538 5a019ed120
Local Time is: Sun Nov 22 22:11:40 2020 EST
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0017): Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 85 Celsius
Critical Comp. Temp. Threshold: 85 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 7.50W - - 0 0 0 0 0 0
1 + 5.90W - - 1 1 1 1 0 0
2 + 3.60W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 210 1200
4 - 0.0050W - - 4 4 4 4 2000 8000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 42 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 0%
Data Units Read: 14,723 [7.53 GB]
Data Units Written: 4,508,008 [2.30 TB]
Host Read Commands: 243,468
Host Write Commands: 176,596,876
Controller Busy Time: 1,060
Power Cycles: 4
Power On Hours: 205
Unsafe Shutdowns: 3
Media and Data Integrity Errors: 0
Error Information Log Entries: 0
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 42 Celsius
Temperature Sensor 2: 46 Celsius
Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged
让我们担心的部分是:
Data Units Written: 4,508,008 [2.30 TB]
其使用寿命为 250TB,因此使用 2TB 是疯狂的,而且没有任何意义。
我们怎样才能弄清楚为什么这个数字如此之高?
谢谢!
========================
@heynnema 感谢您的关注!以下是对您评论的回复(仅供参考,我在安装新 SSD 后关闭了 Ubuntu swap)
root@~ $ free -h
total used free shared buff/cache available
Mem: 251Gi 42Gi 153Gi 3.0Mi 56Gi 208Gi
Swap: 0B 0B 0B
root@~ $ sysctl vm.swappiness
vm.swappiness = 60
root@~ $ grep -i swap /etc/fstab
#/swap.img none swap sw 0 0
========================== 附加信息:
我运行了 iotop 如下:
iotop -ao
运行一段时间后得到以下结果:
Total DISK READ : 0.00 B/s | Total DISK WRITE : 147.34 K/s
Actual DISK READ: 0.00 B/s | Actual DISK WRITE: 357.38 K/s
TID PRIO USER DISK READ DISK WRITE> SWAPIN IO COMMAND
29546 be/4 999 0.00 B 212.62 M 0.00 % 0.01 % mongod --auth --bind_ip_all [WTCheck.tThread]
855 be/3 root 0.00 B 101.82 M 0.00 % 1.65 % [jbd2/nvme1n1p1-]
1841 be/4 root 0.00 B 33.69 M 0.00 % 0.00 % python /opt/conda/bin/supervisord -c /etc/supervisor/supervisord.conf
看起来罪魁祸首是 mongo 和 jbd2。我如何弄清楚 jbd2 在做什么?感谢大家的帮助!
答案1
您可以使用 进行检查iotop
。 但是,这不会显示对驱动器的总写入量,但它可以让您查看应用程序是否大量写入驱动器。
sudo apt install iotop
然后以提升的权限运行它:
sudo iotop
您应该看到类似下面的内容:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 248.20 K/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
1780425 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.37 % [kworker~e_power_]
5170 be/4 terrance 0.00 B/s 248.20 K/s 0.00 % 0.00 % firefox ~orage #3]
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init nosplash
2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
3 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_gp]
4 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_par_gp]
6 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker~-kblockd]
8 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [mm_percpu_wq]
9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
10 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [rcu_sched]
11 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
12 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [idle_inject/0]
14 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuhp/0]
15 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuhp/1]
16 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [idle_inject/1]
17 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1]
18 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1]
20 be/0 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kworker~-kblockd]
21 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [cpuhp/2]
keys: any: refresh q: quit i: ionice o: active p: procs a: accum
sort: r: asc left: SWAPIN right: COMMAND home: TID end: COMMAND
希望这可以帮助!
答案2
重要的变量是Percentage used
当前为 0%。当其为 1% 时,乘以自新品以来的月数即可得到使用寿命。