NVME 驱动器上过度写入 - 如何诊断?

NVME 驱动器上过度写入 - 如何诊断?

我们的一块三星 2TB NVME SSD 最近出现故障,因此我们将其换成了新的,并开始密切关注 SMART 测试。

以下是不到两周前安装的驱动器的输出:

root@~ $ smartctl -a /dev/nvme0n1p1
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-53-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung SSD 970 EVO Plus 2TB
Serial Number:                      S59CNZFNA02015F
Firmware Version:                   2B2QEXM7
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            129,469,706,240 [129 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 5a019ed120
Local Time is:                      Sun Nov 22 22:11:40 2020 EST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     85 Celsius
Critical Comp. Temp. Threshold:     85 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.50W       -        -    0  0  0  0        0       0
 1 +     5.90W       -        -    1  1  1  1        0       0
 2 +     3.60W       -        -    2  2  2  2        0       0
 3 -   0.0700W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        42 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    14,723 [7.53 GB]
Data Units Written:                 4,508,008 [2.30 TB]
Host Read Commands:                 243,468
Host Write Commands:                176,596,876
Controller Busy Time:               1,060
Power Cycles:                       4
Power On Hours:                     205
Unsafe Shutdowns:                   3
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               42 Celsius
Temperature Sensor 2:               46 Celsius

Error Information (NVMe Log 0x01, max 64 entries)
No Errors Logged

让我们担心的部分是:

Data Units Written:                 4,508,008 [2.30 TB]

其使用寿命为 250TB,因此使用 2TB 是疯狂的,而且没有任何意义。

我们怎样才能弄清楚为什么这个数字如此之高?

谢谢!

========================

@heynnema 感谢您的关注!以下是对您评论的回复(仅供参考,我在安装新 SSD 后关闭了 Ubuntu swap)

root@~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:          251Gi        42Gi       153Gi       3.0Mi        56Gi       208Gi
Swap:            0B          0B          0B

root@~ $ sysctl vm.swappiness
vm.swappiness = 60

root@~ $ grep -i swap /etc/fstab
#/swap.img  none    swap    sw  0   0

========================== 附加信息:

我运行了 iotop 如下:

iotop -ao

运行一段时间后得到以下结果:

Total DISK READ :       0.00 B/s | Total DISK WRITE :     147.34 K/s
Actual DISK READ:       0.00 B/s | Actual DISK WRITE:     357.38 K/s
  TID  PRIO  USER     DISK READ DISK WRITE>  SWAPIN      IO    COMMAND
29546 be/4 999           0.00 B    212.62 M  0.00 %  0.01 % mongod --auth --bind_ip_all [WTCheck.tThread]
  855 be/3 root          0.00 B    101.82 M  0.00 %  1.65 % [jbd2/nvme1n1p1-]
 1841 be/4 root          0.00 B     33.69 M  0.00 %  0.00 % python /opt/conda/bin/supervisord -c /etc/supervisor/supervisord.conf

看起来罪魁祸首是 mongo 和 jbd2。我如何弄清楚 jbd2 在做什么?感谢大家的帮助!

答案1

您可以使用 进行检查iotop但是,这不会显示对驱动器的总写入量,但它可以让您查看应用程序是否大量写入驱动器。

sudo apt install iotop

然后以提升的权限运行它:

sudo iotop

您应该看到类似下面的内容:

Total DISK READ:         0.00 B/s | Total DISK WRITE:       248.20 K/s
Current DISK READ:       0.00 B/s | Current DISK WRITE:       0.00 B/s
    TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND        
1780425 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.37 % [kworker~e_power_]
   5170 be/4 terrance    0.00 B/s  248.20 K/s  0.00 %  0.00 % firefox ~orage #3]
      1 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % init nosplash
      2 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kthreadd]
      3 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_gp]
      4 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_par_gp]
      6 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker~-kblockd]
      8 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [mm_percpu_wq]
      9 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/0]
     10 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [rcu_sched]
     11 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/0]
     12 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [idle_inject/0]
     14 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/0]
     15 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/1]
     16 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [idle_inject/1]
     17 rt/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
     18 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [ksoftirqd/1]
     20 be/0 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [kworker~-kblockd]
     21 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [cpuhp/2]
  keys:  any: refresh  q: quit  i: ionice  o: active  p: procs  a: accum        
  sort:  r: asc  left: SWAPIN  right: COMMAND  home: TID  end: COMMAND          

希望这可以帮助!

答案2

重要的变量是Percentage used当前为 0%。当其为 1% 时,乘以自新品以来的月数即可得到使用寿命。

看:如何检查系统健康状况?

相关内容