如何监控Linux服务的运行时间和停机时间

Question 1

如果你想监控不同服务器上的许多应用程序，那么就选择 NagiOS，如果你想监控特定的应用程序、文件所有权等任何东西，那么就选择 Monit。

监视守护进程进程或在本地主机上运行的类似程序。 Monit 对于监视守护进程特别有用，例如在系统引导时从 /etc/init.d/ 启动的守护进程。例如sendmail、sshd、apache 和mysql。

与许多监控系统不同，Monit 可以在发生错误情况时采取行动，例如：如果 sendmail 没有运行，Monit 可以自动再次启动 sendmail，或者如果 apache 使用太多资源（例如，如果正在进行 DoS 攻击）Monit 可以停止或重新启动 apache 并向您发送警报消息。 Monit 还可以监控进程特征，例如；进程正在使用多少内存或 CPU 周期

更新::配置部分

Monit 最容易通过 aptitude 或 apt-get 安装

sudo aptitude install monit

monit 下载后，您可以将程序和进程添加到配置文件中

vim /etc/monit/monitrc

set daemon 3                    # check services at 3-second intervals
set logfile /var/log/monit.log  # you can see what monit is doing
set alert [email protected]        # receive all alerts
include /etc/monit.d/*          # add monit script path

然后为您的应用程序创建 monit 脚本，只需查看以下脚本示例：

您只需要创建 monit 脚本，/etc/monit.d/然后/etc/monit.d/httpd.monit重新加载 monit 服务并检查 monit 日志tail -f /var/log/monit.log

对于阿帕奇

check process apache with pidfile /usr/local/apache/logs/httpd.pid
   start program = "/etc/init.d/httpd start" with timeout 60 seconds
   stop program  = "/etc/init.d/httpd stop" 
   if cpu > 60% for 2 cycles then alert
   if cpu > 80% for 5 cycles then restart
   if totalmem > 200.0 MB for 5 cycles then restart
   if children > 250 then restart
   if loadavg(5min) greater than 10 for 8 cycles then stop
   if failed host www.tildeslash.com port 80 protocol http
      and request "/monit/doc/next.php"
      then restart
   if failed port 443 type tcpssl protocol http
      with timeout 15 seconds
      then restart
   if 3 restarts within 5 cycles then timeout
   depends on apache_bin
   group server

对于 Safesquid 代理

# Check if the safesquid process is running by monitoring the PID recorded in /opt/safesquid/safesquid/run/safesquid.pid
check process safesquid with pidfile /opt/safesquid/safesquid/run/safesquid.pid
group root
start program = "/etc/init.d/safesquid start"
stop program = "/etc/init.d/safesquid stop"
mode active
# If safesquid process is active it must be updating the performance log at
# /opt/safesquid/safesquid/logs/performance/performance.log every 2 seconds.
# If the file is more than 3 seconds old we definitely have a problem

check file "safesquid-PERFORMANCELOG" with path /opt/safesquid/safesquid/logs/performance/performance.log
  if timestamp > 3 SECOND then alert

Answer

如果你想监控不同服务器上的许多应用程序，那么就选择 NagiOS，如果你想监控特定的应用程序、文件所有权等任何东西，那么就选择 Monit。

您可以使用监控

监视守护进程进程或在本地主机上运行的类似程序。 Monit 对于监视守护进程特别有用，例如在系统引导时从 /etc/init.d/ 启动的守护进程。例如sendmail、sshd、apache 和mysql。

与许多监控系统不同，Monit 可以在发生错误情况时采取行动，例如：如果 sendmail 没有运行，Monit 可以自动再次启动 sendmail，或者如果 apache 使用太多资源（例如，如果正在进行 DoS 攻击）Monit 可以停止或重新启动 apache 并向您发送警报消息。 Monit 还可以监控进程特征，例如；进程正在使用多少内存或 CPU 周期

更新::配置部分

Monit 最容易通过 aptitude 或 apt-get 安装

sudo aptitude install monit

monit 下载后，您可以将程序和进程添加到配置文件中

vim /etc/monit/monitrc

set daemon 3                    # check services at 3-second intervals
set logfile /var/log/monit.log  # you can see what monit is doing
set alert [email protected]        # receive all alerts
include /etc/monit.d/*          # add monit script path

然后为您的应用程序创建 monit 脚本，只需查看以下脚本示例：

您只需要创建 monit 脚本，/etc/monit.d/然后/etc/monit.d/httpd.monit重新加载 monit 服务并检查 monit 日志tail -f /var/log/monit.log

对于阿帕奇

check process apache with pidfile /usr/local/apache/logs/httpd.pid
   start program = "/etc/init.d/httpd start" with timeout 60 seconds
   stop program  = "/etc/init.d/httpd stop" 
   if cpu > 60% for 2 cycles then alert
   if cpu > 80% for 5 cycles then restart
   if totalmem > 200.0 MB for 5 cycles then restart
   if children > 250 then restart
   if loadavg(5min) greater than 10 for 8 cycles then stop
   if failed host www.tildeslash.com port 80 protocol http
      and request "/monit/doc/next.php"
      then restart
   if failed port 443 type tcpssl protocol http
      with timeout 15 seconds
      then restart
   if 3 restarts within 5 cycles then timeout
   depends on apache_bin
   group server

对于 Safesquid 代理

# Check if the safesquid process is running by monitoring the PID recorded in /opt/safesquid/safesquid/run/safesquid.pid
check process safesquid with pidfile /opt/safesquid/safesquid/run/safesquid.pid
group root
start program = "/etc/init.d/safesquid start"
stop program = "/etc/init.d/safesquid stop"
mode active
# If safesquid process is active it must be updating the performance log at
# /opt/safesquid/safesquid/logs/performance/performance.log every 2 seconds.
# If the file is more than 3 seconds old we definitely have a problem

check file "safesquid-PERFORMANCELOG" with path /opt/safesquid/safesquid/logs/performance/performance.log
  if timestamp > 3 SECOND then alert

Question 2

如果你知道你想要监控的服务的 pid，我不久前写了这个来跟踪服务器上特定事物的资源使用情况：

http://cognitivedissonance.ca/cogware/plog

它完全稳定，非常低调，并且使用起来相当简单。它会报告您可能在顶部看到的内容的稍微详细的版本，但频率较低，并报告到日志文件。因此，例如，您可以将其设置为每分钟或每五分钟检查一次进程 - 这可能不会为您提供有关原因的很多线索，但它会给您一个停止时间的窗口。

Answer

如果你知道你想要监控的服务的 pid，我不久前写了这个来跟踪服务器上特定事物的资源使用情况：

http://cognitivedissonance.ca/cogware/plog

它完全稳定，非常低调，并且使用起来相当简单。它会报告您可能在顶部看到的内容的稍微详细的版本，但频率较低，并报告到日志文件。因此，例如，您可以将其设置为每分钟或每五分钟检查一次进程 - 这可能不会为您提供有关原因的很多线索，但它会给您一个停止时间的窗口。

Question 3

在评论中，您提到您正在尝试监视 JBoss Web 服务器。

您询问过如何监控您的服务，不是你的进程。如果 JBoss 仍在运行，如果进程已陷入困境并且不再回答查询，那么这并不重要。您想知道是否服务不工作，而不仅仅是进程终止。

如果您不想运行大规模服务监控包，例如纳吉奥斯或者伊辛加或者扎比克斯或者开放网管系统或者新肯或者泽诺斯，您可以随时使用curl或之类的东西进行投注wget。

创建一个脚本，我们将其命名为/root/bin/check_web，并在 crontab 中运行它：

*/5 * * * * /root/bin/check_web http://www.example.com [email protected]

该脚本可能类似于：

#!/bin/bash

if [[ $1 !~ ^https?://[a-z][a-z.]+ ]]; then
  echo "ERROR: that doesn't look like a URL ($1)" >&2
  exit 1
elif [[ $2 !~ .+@[a-z0-9.-]+ ]]; then
  echo "ERROR: that doesn't look like an email address ($2)" >&2
  exit 1
fi

flag="/tmp/m-${1//[^[:alnum:]:.-]/_}"

wget -O /dev/null -q "$1"
result=$?

if [[ $result -eq 0 ]]; then
  if [ -f "$flag" ]; then
    date | Mail -s "Clear: $1" "$2"
    rm -f "$flag"
  fi
else
  if [ ! -f "$flag" ]; then
    echo "error: $?" | Mail -s "OFFLINE: $1" "$2"
    touch "$flag"
  fi
fi

if如果出现问题，s的嵌套有助于减少电子邮件噪音。当您致力于解决问题时，无需每 5 分钟就收到另一个通知来分散您的注意力。但很高兴收到事情已恢复的通知，以防问题因自发重启或短暂的网络中断而发生。

使用像这样的稍微通用的脚本，您可以监控多个站点，并设置多个电子邮件收件人以进行通知。

再创建几个这样的脚本，也许还可以添加功能，以便在服务完全离线时发出与 CRITICAL 不同的缓慢响应的 WARNING，然后提供一个 Web 前端来浏览和管理各个主机的状态，并创建一个运行这些而不是 cron 的专用守护进程，您就拥有了 Nagios。 :-)

Answer