使用 4TB 磁盘时 iowait 非常高

使用 4TB 磁盘时 iowait 非常高

我有一台较旧的 PC,用作一些 VM 和 docker 容器的主机。到目前为止,系统运行良好,直到几天前。它开始变得非常慢,并且 Glances 中的 iowait 始终处于 30-40% 的赤字状态。系统是 Ubuntu 20.04.1 LTS,并且完全是最新的。

这台 PC 配有 500GB SSD 和 4TB 普通 HDD,以及 16GB RAM。没有什么特别的,但正如我所说,它运行得很好。Virtualbox VM 都存储在 4TB HDD 上。VM(所有 Ubuntu 服务器)运行非常慢,以至于我经常出现超时,一个简单的操作apt update需要 2-3 分钟。令人惊讶的是,PC 上的 HDD 指示灯一直亮着。

我已将所有虚拟机 VDI 文件移至我的 NAS,并将一个持久卷从 MySQL docker 容器移出 4TB 磁盘。突然间,虚拟机和 docker 容器的性能又恢复正常了,所以我猜是 4TB 磁盘导致了问题。

我先运行了一下sudo badblocks -v /dev/sda > badsectors.txt,但并没有显示任何坏扇区。它返回 0 个错误。

然后我卸载了磁盘,并从中排除了驱动器/etc/fstab。重新启动后,我运行了sudo fsck -f /dev/sda1

令我惊讶的是,我得到了这样的结果:

user@vs01:~$ sudo fsck -f /dev/sda1
[sudo] password for user:
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda1: 298/244195328 files (9.4% non-contiguous), 91221367/976754176 blocks

磁盘很干净,没有任何错误,但存在明显的性能问题。SmartMonitor 也没有显示任何问题。

user@vs01:~$ sudo smartctl --all /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-51-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     TOSHIBA DT02ABA400
Serial Number:    X991S1DQS75H
LU WWN Device Id: 5 000039 995602197
Firmware Version: KQ000A
User Capacity:    4,000,787,030,016 bytes [4.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Form Factor:      3.5 inches
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-3 T13/2161-D revision 5
SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Oct 17 13:09:11 2020 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (  39) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                (  120) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        No Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 510) minutes.
SCT capabilities:              (0x003d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   050    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   100   100   050    Pre-fail  Offline      -       0
  3 Spin_Up_Time            0x0027   100   100   001    Pre-fail  Always       -       6161
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   050    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   050    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   100   100   050    Pre-fail  Offline      -       0
  9 Power_On_Hours          0x0032   094   094   000    Old_age   Always       -       2584
 10 Spin_Retry_Count        0x0033   100   100   030    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       5
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       1
193 Load_Cycle_Count        0x0032   100   100   000    Old_age   Always       -       832
194 Temperature_Celsius     0x0022   100   100   000    Old_age   Always       -       37 (Min/Max 23/52)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
220 Disk_Shift              0x0002   100   100   000    Old_age   Always       -       0
222 Loaded_Hours            0x0032   094   094   000    Old_age   Always       -       2529
223 Load_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
224 Load_Friction           0x0022   100   100   000    Old_age   Always       -       0
226 Load-in_Time            0x0026   100   100   000    Old_age   Always       -       878
240 Head_Flying_Hours       0x0001   100   100   001    Pre-fail  Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Interrupted (host reset)      70%      2561         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

我还有什么可以尝试/检查的方法来找出这些性能缓慢的原因吗?目前,虚拟机在 NAS 上运行,运行良好,但不知何故我想使用 4TB 磁盘。我认为本地磁盘的性能应该比网络连接更快,不是吗?

编辑:

我复制回了一个虚拟机,这就是我现在从iostat -x 1

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz  aqu-sz  %util
sda            381.00   1740.50     7.00   1.80    1.38     4.57    9.00   2568.00    36.00  80.00    2.44   285.33    0.00      0.00     0.00   0.00    0.00     0.00    0.36  68.80

我的磁盘是 5400 主轴,所以它甚至更慢......

我每秒有 381 次读取操作吗?或者哪一列是读/写操作?

答案1

当速度很慢时,运行:

iostat -x 1

您每秒进行多少次读写操作以及 %util 是多少?

7200 rpm 的磁盘仅适用于大约 120 IOPS,因此,如果每秒的读取 + 写入操作总数达到这么多或更多,那么这并不是故障,机械磁盘对于随机工作负载来说速度太慢了。

您可以使用它iotop来查看哪个进程占用了最多的磁盘 I/O。

相关内容