监控服务器上 Iowait 较高 + 平均负载较高

2024-5-29 • tag-icon

我有一台 nagios 服务器，几天前它还运行良好。我停止并重新启动它以增加它的 RAM，从那时起，服务器上的 iowait 急剧增加（超过 20%，之前不到 1%）。我试图将服务器上的 RAM 恢复到原来的大小，但仍然遇到同样的问题。
我在 serverfault 上读到过很多类似的 iowait 问题，但我从未在我的案例中找到解释：
查看 iotop，我发现 pdflush 有很多 io，它正在执行页面缓存和 kjournald，它专用于记录 ext3 文件系统。我不知道这是否正常。根据其他 serverfault 问题，我试图在 fstab 中放入 noatime。Ext3 文件系统以有序数据模式挂载

Total DISK READ: 0.00 B/s | Total DISK WRITE: 210.44 K/s
  TID  PRIO  USER     DISK READ  DISK WRITE  SWAPIN     IO>    COMMAND
  650 be/3 root        0.00 B/s    0.00 B/s  0.00 % 99.99 % [kjournald]
11482 be/4 root        0.00 B/s    0.00 B/s  0.00 % 98.42 % [pdflush]
12167 be/4 nagios      0.00 B/s    0.00 B/s  0.00 %  0.12 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
   11 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.10 % [migration/3]
12168 be/4 nagios      0.00 B/s    0.00 B/s  0.02 %  0.08 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
12165 be/4 nagios      0.00 B/s    0.00 B/s 98.42 %  0.02 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
 2600 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.02 % auditd
12164 be/4 nagios      0.00 B/s    0.00 B/s  0.00 %  0.00 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
    8 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/2]
   20 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/6]
   26 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [events/0]
   23 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/7]
 3047 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % snmpd -Ln -Lf /dev/null -p /var/run/snmpd.pid -a
12169 be/4 nagios      0.00 B/s    0.00 B/s  0.12 %  0.00 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg
   14 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/4]
 2601 be/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % auditd
    5 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/1]
   17 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [migration/5]
 5228 be/4 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % bash
   10 rt/3 root        0.00 B/s    0.00 B/s  0.00 %  0.00 % [watchdog/2]
   13 rt/3 root        0.00 B/s    0.00 B/s  0.10 %  0.00 % [watchdog/3]

下面这行

 12165 be/4 nagios      0.00 B/s    0.00 B/s 98.42 %  0.02 % nagios -d /srv/eyesofnetwork/nagios-3.4.1/etc/nagios.cfg

似乎相当令人惊讶：我几乎没有交换，怎么会有 98.42％的交换：

free -o
             total       used       free     shared    buffers     cached
Mem:       4046468    3163796     882672          0     103548    2193604
Swap:      4192956       1572    4191384

top 没有显示任何具体信息，除了高负载和高 iowait

top - 10:07:56 up 12 days, 23:42,  4 users,  load average: 8.60, 9.29, 9.85
Tasks: 177 total,   1 running, 176 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.1%us,  0.0%sy,  0.0%ni, 77.2%id, 22.6%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   4046468k total,  3165500k used,   880968k free,   104204k buffers
Swap:  4192956k total,     1572k used,  4191384k free,  2201500k cached
  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND            
 5246 root      15   0 14252 2632  836 R  0.3  0.1   0:03.94 top                
    1 root      15   0 10372  696  584 S  0.0  0.0   0:03.61 init               
    2 root      RT  -5     0    0    0 S  0.0  0.0   0:14.80 migration/0        
    3 root      34  19     0    0    0 S  0.0  0.0   0:00.73 ksoftirqd/0        
    4 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/0         
    5 root      RT  -5     0    0    0 S  0.0  0.0   0:13.93 migration/1        
    6 root      34  19     0    0    0 S  0.0  0.0   0:01.75 ksoftirqd/1        
    7 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/1         
    8 root      RT  -5     0    0    0 S  0.0  0.0   0:09.51 migration/2        
    9 root      34  19     0    0    0 S  0.0  0.0   0:01.09 ksoftirqd/2        
   10 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/2         
   11 root      RT  -5     0    0    0 S  0.0  0.0   0:08.98 migration/3        
   12 root      34  19     0    0    0 S  0.0  0.0   0:01.46 ksoftirqd/3        
   13 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/3         
   14 root      RT  -5     0    0    0 S  0.0  0.0   0:20.36 migration/4        
   15 root      34  19     0    0    0 S  0.0  0.0   0:01.15 ksoftirqd/4        
   16 root      RT  -5     0    0    0 S  0.0  0.0   0:00.00 watchdog/4

禁用 nagios 进程使系统负载正常（即 < 1 ）但我仍然得到高 iowait。

在 atop 中，即使没有运行任何 nagios 进程，DSK 也处于 100% 繁忙状态。我的硬盘可能出了问题吗？（这是西部数据绿色硬盘，不应该在这样的服务器中运行）。我在 dmesg 或 syslog 上没有收到任何特殊消息。

答案1

哦，抱歉。除了台式电脑之外，您还在其他设备上使用 WD Green 磁盘吗？

不。

它们速度很慢、不可靠（它们会进入睡眠状态并退出 RAID 阵列），并且完全不适合您想要做的事情。

如果您遇到高 IOWait，则意味着磁盘子系统无法处理所需的磁盘 IO 量。

解决这个问题的简单方法是添加更多磁盘（理想情况下是 RAID6 阵列中的一大堆）。

您还应该使用 smartctl 检查常规磁盘运行状况，并进行备份（无论如何都应该定期执行此操作，但如果您过度使用 WD Green，我会格外小心。）。

答案2

使用 swapoff 和 swapon 命令清除交换。此后，停止 nagios 并检查是否有任何 pid 仍在运行，ps -ef|grep nagios现在再次启动 nagios。

下面的命令将告诉您交换文件系统有哪个分区

swapon -s

swapoff /dev/sdaN

swapon /dev/sdaN

答案1

答案2

相关内容