我有一台较旧的 PC,用作一些 VM 和 docker 容器的主机。到目前为止,系统运行良好,直到几天前。它开始变得非常慢,并且 Glances 中的 iowait 始终处于 30-40% 的赤字状态。系统是 Ubuntu 20.04.1 LTS,并且完全是最新的。
这台 PC 配有 500GB SSD 和 4TB 普通 HDD,以及 16GB RAM。没有什么特别的,但正如我所说,它运行得很好。Virtualbox VM 都存储在 4TB HDD 上。VM(所有 Ubuntu 服务器)运行非常慢,以至于我经常出现超时,一个简单的操作apt update
需要 2-3 分钟。令人惊讶的是,PC 上的 HDD 指示灯一直亮着。
我已将所有虚拟机 VDI 文件移至我的 NAS,并将一个持久卷从 MySQL docker 容器移出 4TB 磁盘。突然间,虚拟机和 docker 容器的性能又恢复正常了,所以我猜是 4TB 磁盘导致了问题。
我先运行了一下sudo badblocks -v /dev/sda > badsectors.txt
,但并没有显示任何坏扇区。它返回 0 个错误。
然后我卸载了磁盘,并从中排除了驱动器/etc/fstab
。重新启动后,我运行了sudo fsck -f /dev/sda1
。
令我惊讶的是,我得到了这样的结果:
user@vs01:~$ sudo fsck -f /dev/sda1
[sudo] password for user:
fsck from util-linux 2.34
e2fsck 1.45.5 (07-Jan-2020)
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
/dev/sda1: 298/244195328 files (9.4% non-contiguous), 91221367/976754176 blocks
磁盘很干净,没有任何错误,但存在明显的性能问题。SmartMonitor 也没有显示任何问题。
user@vs01:~$ sudo smartctl --all /dev/sda
smartctl 7.1 2019-12-30 r5022 [x86_64-linux-5.4.0-51-generic] (local build)
Copyright (C) 2002-19, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Device Model: TOSHIBA DT02ABA400
Serial Number: X991S1DQS75H
LU WWN Device Id: 5 000039 995602197
Firmware Version: KQ000A
User Capacity: 4,000,787,030,016 bytes [4.00 TB]
Sector Sizes: 512 bytes logical, 4096 bytes physical
Rotation Rate: 5400 rpm
Form Factor: 3.5 inches
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: ACS-3 T13/2161-D revision 5
SATA Version is: SATA 3.3, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is: Sat Oct 17 13:09:11 2020 HKT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 39) The self-test routine was interrupted
by the host with a hard or soft reset.
Total time to complete Offline
data collection: ( 120) seconds.
Offline data collection
capabilities: (0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 510) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
2 Throughput_Performance 0x0005 100 100 050 Pre-fail Offline - 0
3 Spin_Up_Time 0x0027 100 100 001 Pre-fail Always - 6161
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 7
5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail Always - 0
7 Seek_Error_Rate 0x000b 100 100 050 Pre-fail Always - 0
8 Seek_Time_Performance 0x0005 100 100 050 Pre-fail Offline - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age Always - 2584
10 Spin_Retry_Count 0x0033 100 100 030 Pre-fail Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 7
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always - 5
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always - 1
193 Load_Cycle_Count 0x0032 100 100 000 Old_age Always - 832
194 Temperature_Celsius 0x0022 100 100 000 Old_age Always - 37 (Min/Max 23/52)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
220 Disk_Shift 0x0002 100 100 000 Old_age Always - 0
222 Loaded_Hours 0x0032 094 094 000 Old_age Always - 2529
223 Load_Retry_Count 0x0032 100 100 000 Old_age Always - 0
224 Load_Friction 0x0022 100 100 000 Old_age Always - 0
226 Load-in_Time 0x0026 100 100 000 Old_age Always - 878
240 Head_Flying_Hours 0x0001 100 100 001 Pre-fail Offline - 0
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA_of_first_error
# 1 Extended offline Interrupted (host reset) 70% 2561 -
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
我还有什么可以尝试/检查的方法来找出这些性能缓慢的原因吗?目前,虚拟机在 NAS 上运行,运行良好,但不知何故我想使用 4TB 磁盘。我认为本地磁盘的性能应该比网络连接更快,不是吗?
编辑:
我复制回了一个虚拟机,这就是我现在从iostat -x 1
Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz aqu-sz %util
sda 381.00 1740.50 7.00 1.80 1.38 4.57 9.00 2568.00 36.00 80.00 2.44 285.33 0.00 0.00 0.00 0.00 0.00 0.00 0.36 68.80
我的磁盘是 5400 主轴,所以它甚至更慢......
我每秒有 381 次读取操作吗?或者哪一列是读/写操作?
答案1
当速度很慢时,运行:
iostat -x 1
您每秒进行多少次读写操作以及 %util 是多少?
7200 rpm 的磁盘仅适用于大约 120 IOPS,因此,如果每秒的读取 + 写入操作总数达到这么多或更多,那么这并不是故障,机械磁盘对于随机工作负载来说速度太慢了。
您可以使用它iotop
来查看哪个进程占用了最多的磁盘 I/O。