monit:如何重新启动多个 Tomcat 而不使服务器过载?

monit:如何重新启动多个 Tomcat 而不使服务器过载?

我的服务器有多个独立的 Apache Tomcat 实例,每个实例都需要花费大量时间和 CPU 才能启动。无法同时启动所有实例。这会产生过多的 I/O,每个服务都需要更长的时间才能启动,甚至可能由于内部超时而无法启动服务。

以下是一些描述我想要做的事情的伪代码。如何使用 monitrc 文件来实现这一点?

check process service01 with pidfile /var/run/service01.pid
    start program = "/usr/sbin/service service01 start" with timeout 60 seconds
    stop program  = "/usr/sbin/service service01 stop"
    if does not exist then
        wait a random number of seconds (between 2 and 5 minutes)
        if the cpu load is < 100% then
            start program
        else 
            do nothing (check again in the next cycle)

check process service02 with pidfile /var/run/service02.pid
....

此代码块将针对这 10 项服务重复执行。

关键步骤是随机等待。否则,如果服务器处于空闲状态且没有服务正在运行(例如在“killall -9 java”之后),monit 将检查所有服务,发现此时 CPU 负载较低,并立即启动所有服务。

答案1

你没有告诉你操作系统的太多信息,我只能假设它是 Linux(部分kill -9 ...)。我也不太了解监控,但假设它是一个灵活的解决方案,允许您在服务失败时重试启动服务。

我假设 Tomcat 实例是通过 shell 启动脚本启动的。在这些脚本的开头某处添加:

# edit the 3 lines to set your limits
LOAD_THRESHOLD=0.75
LOCK_TIME=30
TIME_LIMIT=120

LOCK_FILE='/var/lock/tomcat-delay.lock'

if [ -z "${TOMCAT_NOLOCK}" ]; then
    # simple locking mechanism to avoid simultaneous start of instances
    if [ -f "${LOCK_FILE}" ] && [ $(cat "${LOCK_FILE}") -gt $(date '+%s') ]; then
        exit 1
    else
        expr $(date '+%s') + ${LOCK_TIME} 1>"${LOCK_FILE}"
    fi
fi

T_TIME=0
while true; do
    # check for non-empty TOMCAT_NOWAIT
    if [ -n "${TOMCAT_NOWAIT}" ]; then
        break 1
    fi
    read T_LOAD60 T_REST </proc/loadavg
    # check current 60 sec. average value for system load
    if expr ${T_LOAD60} '<' ${LOAD_THRESHOLD} 1>/dev/null; then
        break 1
    fi
    # check for timeout
    if [ ${T_TIME} -ge ${TIME_LIMIT} ]; then
        # change to 'exit 1' to fail on timeout instead of proceeding
        break 1
    fi
    sleep 1s
    echo -n '.'
    T_TIME=$((${T_TIME} + 1))
done

上述代码实际上并不只检查 CPU 负载,而是检查系统平均负载,其设计包括了所有可能降低性能的因素。时限以秒为单位。如果负载在给定时间内不会低于给定阈值,脚本将最终尝试启动您的服务 - 最后一部分break 1可以更改为出口 1中止启动并告知监控守护进程重试。

如果你尝试手动启动服务(不是从监控),它也会等待,我认为这是个优点。您可以导出环境TOMCAT_NOWAIT使用非空值来避免这种情况。

编辑#1:添加了简单的锁定机制,作为同时启动实例问题的解决方法。非空环境TOMCAT_NOLOCK禁用锁定。设置锁定时间实例的预热时间,以便正确检测到高负载。

答案2

我现在已经想出了一个可以完成这项工作的设置。重新启动或多个进程失败后,将检查 CPU 负载,并且只有当 CPU 负载低于 1 或经过长时间延迟后,才会启动每个服务。以下脚本在我的环境中运行良好:

编辑/etc/monit/monitrc:

...
## Start Monit in the background (run as a daemon):
#
set daemon 120              # check services at 2-minute intervals
    with start delay 240    # optional: delay the first check by 4-minutes (by
#                           # default Monit check immediately after Monit start)

对于每个服务,将其添加到 /etc/monit/conf.d:

check process myname with pidfile /var/run/app0000.pid
    start program = "/usr/sbin/service app0000 start" with timeout 60 seconds
    stop program  = "/usr/sbin/service app0000 stop"
    if does not exist then exec "/root/bin/service_with_delay app0000 start"

创建脚本/root/bin/service_with_delay:

#!/bin/bash
(
  # Wait for lock on /var/lock/service_with_delay.lock (fd 9)
  flock -n 9 || exit 1

  for i in `seq 1 10`; do

    # start the service if the cpu load is < 1.0 or after waiting for 300 seconds

    read load ignore </proc/loadavg
    flag=`expr ${load} '<' 1`
    if [ ${flag} -eq 1 ] || [ ${i} -eq 10 ]; then

        echo `date` service_with_delay $1: pid $$ load ${load} i ${i} - starting >> /var/log/service_with_delay.log
        /usr/sbin/service $1 start

        # make sure next script getting the lock sees some load
        sleep 60
        break
    fi

    # wait
    echo `date` service_with_delay $1: pid $$ load ${load} i ${i} >> /var/log/service_with_delay.log
    sleep 30
  done
) 9> /var/lock/service_with_delay.lock

相关内容