- 问题:父脚本随机产生过多的工作脚本。
- 怀疑: bash bug 但无法识别它
- 操作系统:Ubuntu 16.04.03 LTS
- GNU 重击:版本 4.3.48(1)-发布 (x86_64-pc-linux-gnu)
- MySQL:5.7.21(MySQL 存储库)
父脚本的作用是从 MySQL 获取数据,并使用 MySQL 中的数据在后台执行工作脚本。父脚本负责保持不超过七个工作脚本运行,直到处理完所有数据。直到大约一个月前,这种方法多年来一直完美无缺。我怀疑我的问题是由最近的更新引起的。逻辑如下:
- 父脚本从MySQL服务器获取数据
- 父脚本在后台循环并启动工作脚本,传递从 MySQL 获取的数据。
- 工作脚本生成并写入“锁定文件”
- 父脚本通过监视生成的工作脚本的数量和工作锁文件的数量来保持最多七个工作脚本的运行。
我知道父脚本可以在子进程有时间设置和写出其锁定文件之前循环并生成多个子进程。这就是为什么我维护子生成计数 (SPWNCNT) 来帮助避免这种情况。就像我说的,它过去工作得很好,但现在不行了。
这是脚本的部分:
#!/bin/bash
......
###########################
# Loop through all unique codes and process them.
echo "$0: NOTICE: Started processing codes; `date`."
COMCNT=0
SPWNCNT=0
TPCNT=0
while read COMNUM
do
# Only permit a certain number of child processes
# so to not overload the machine or chew up to
# many MySQL connections.
PCNT=`ls -1 $TEMPDIR/*.Worker.lock 2>/dev/null | wc -l`
(( TPCNT = PCNT + SPWNCNT ))
TPCNT=`echo ${TPCNT#-}`
while [[ $TPCNT -gt 6 ]]
do
# Too many child processes.
sleep 1
PCNT=`ls -1 $TEMPDIR/*.Worker.lock 2>/dev/null | wc -l`
(( TPCNT = PCNT + SPWNCNT ))
TPCNT=`echo ${TPCNT#-}`
if [[ $SPWNCNT -gt 0 ]]
then
(( SPWNCNT = SPWNCNT - PCNT ))
if [[ $SPWNCNT -lt 0 ]]
then
SPWNCNT=0
fi
fi
done # while -gt 6
# Spawn a worker process
./Worker.sh $COMNUM &
(( SPWNCNT = SPWNCNT + 1 ))
(( COMCNT = COMCNT + 1 ))
if [ "$VERBOSE" = "true" ]
then
echo "$0: NOTICE: Spawned $COMNUM, count $COMCNT ($SPWNCNT); `date`"
fi
done << COMNUM_EOF
`cat $GEN_RATES_COMNUM_FILE | $MyCMD -B -N $MyDB`
COMNUM_EOF
......
这是问题的调试输出(使用 #!/bin/bash -x)
...... many lines showing same logic working correctly ......
++ wc -l
++ ls -1 '/tmp/*.Worker.lock'
+ PCNT=0
+ (( TPCNT = 0 + 7 ))
++ echo 7
+ TPCNT=7
+ [[ 7 -gt 0 ]]
+ (( SPWNCNT = 7 - 0 ))
+ [[ 7 -lt 0 ]]
+ [[ 7 -gt 6 ]]
+ sleep 1
++ wc -l
++ ls -1 /tmp/032500.Worker.lock /tmp/032800.Worker.lock
/tmp/033300.Worker.lock /tmp/033900.Worker.lock /tmp/034700.Worker.lock
/tmp/035400.Worker.lock /tmp/035600.Worker.lock /tmp/036000.Worker.lock
/tmp/036200.Worker.lock /tmp/036400.Worker.lock /tmp/036600.Worker.lock
/tmp/037000.Worker.lock /tmp/039100.Worker.lock /tmp/039600.Worker.lock
/tmp/040200.Worker.lock /tmp/040400.Worker.lock /tmp/041000.Worker.lock
/tmp/041200.Worker.lock /tmp/041600.Worker.lock /tmp/041800.Worker.lock
/tmp/042000.Worker.lock /tmp/043600.Worker.lock /tmp/046200.Worker.lock
/tmp/048600.Worker.lock /tmp/049600.Worker.lock /tmp/052000.Worker.lock
/tmp/052300.Worker.lock /tmp/054300.Worker.lock /tmp/054500.Worker.lock
/tmp/054900.Worker.lock /tmp/055300.Worker.lock /tmp/056000.Worker.lock
/tmp/056200.Worker.lock
/tmp/056600.Worker.lock /tmp/056900.Worker.lock /tmp/057800.Worker.lock
/tmp/058600.Worker.lock /tmp/060400.Worker.lock /tmp/060700.Worker.lock
+ PCNT=39
+ (( TPCNT = 39 + 7 ))
++ echo 46
+ TPCNT=46
+ [[ 7 -gt 0 ]]
+ (( SPWNCNT = 7 - 39 ))
+ [[ -32 -lt 0 ]]
+ SPWNCNT=0
+ [[ 46 -gt 6 ]]
+ sleep 1
++ wc -l
++ ls -1 /tmp/032500.Worker.lock /tmp/032800.Worker.lock
/tmp/033300.Worker.lock /tmp/033900.Worker.lock /tmp/034700.Worker.lock
/tmp/035400.Worker.lock /tmp/035600.Worker.lock /tmp/036000.Worker.lock
/tmp/036200.Worker.lock /tmp/036400.Worker.lock /tmp/036600.Worker.lock
/tmp/037000.Worker.lock /tmp/039100.Worker.lock /tmp/039600.Worker.lock
/tmp/040200.Worker.lock /tmp/040400.Worker.lock /tmp/041000.Worker.lock
/tmp/041200.Worker.lock /tmp/041600.Worker.lock /tmp/041800.Worker.lock
/tmp/042000.Worker.lock /tmp/043600.Worker.lock /tmp/046200.Worker.lock
/tmp/048600.Worker.lock /tmp/049600.Worker.lock /tmp/052000.Worker.lock
/tmp/052300.Worker.lock /tmp/054300.Worker.lock /tmp/054500.Worker.lock
/tmp/054900.Worker.lock /tmp/055300.Worker.lock /tmp/056000.Worker.lock
/tmp/056200.Worker.lock
/tmp/056600.Worker.lock /tmp/056900.Worker.lock /tmp/057800.Worker.lock
/tmp/058600.Worker.lock /tmp/060400.Worker.lock /tmp/060700.Worker.lock
+ PCNT=39
+ (( TPCNT = 39 + 0 ))
++ echo 39
+ TPCNT=39
+ [[ 0 -gt 0 ]]
+ [[ 39 -gt 6 ]]
+ sleep 1
那么,脚本是如何从 7 个正在运行的进程 (TPCNT)(正确的目标)跳到 39 个,然后跳到 46 个呢?我看不出逻辑如何允许这种情况,并且除了刚刚脱落的 bash shell 之外,调试输出似乎没有对此提供任何说明。
答案1
对于那些在以后的生活中找到这篇文章的人......这是基于我的解决方案模块化的回应如上。模块化实际上值得赞扬。
#!/bin/bash
.....
# Intialize the Child Process List (array)
CPLIST=()
# Max concurrent child processes
MAXCP=6
# Worker loop to spawn and monitor child (worker) processes
while [[ SOME-CONDITION ]]
do
# Monitor Child Process List (array)
# Ensure that we don't exceed Max Child Processes
if [[ ${#CPLIST[@]} -gt $MAXCP ]]
then
while [[ ${#CPLIST[@]} -gt $MAXCP ]]
do
sleep 1
# Check each child processes to see if it's still running.
for idx in ${!CPLIST[@]}
do
# Is child process still alive?
kill -0 ${CPLIST[$idx]} 2>/dev/null
if [[ $? -gt 0 ]]
then
# Child process is no longer running.
# Remove it from the child process list (array).
unset CPLIST[$idx]
fi # if $?
done # for idx
done # while MAXCP
fi # if MAXCP
# Spawn a child process
./MyProgram &
# Append Child Process PID to Child Process List
CPLIST=(${CPLIST[@]} $!)
done # while
.....
# (end of file)