我正在尝试编写一个用于管理超级计算机上的作业的脚本。细节并不重要,但关键是脚本tail -f
一旦出现就会启动到文件。现在这将永远运行,但我想在检测到作业完成后彻底停止它并退出脚本。
不幸的是我被困住了。我尝试了多种解决方案,但它们都没有退出脚本,即使在检测到作业退出后它仍然继续运行。下面的版本似乎是最合乎逻辑的版本,但这个版本也将永远运行。
我应该如何解决这个问题?我对 bash 很熟悉,但还不是很高级。
#!/bin/bash
# get the path to the job script, print help if not passed
jobscr="$1"
[[ -z "$jobscr" ]] && echo "Usage: submit-and-follow [script to submit]" && exit -2
# submit job via SLURM (the job secluder), and get the
# job ID (4-5-digit number) from it's output, exit if failed
jobmsg=$(sbatch "$jobscr")
ret=$?
echo "$jobmsg"
if [ ! $ret -eq 0 ]; then exit $ret; fi
jobid=$(echo "$jobmsg" | cut -d " " -f 4)
# get the stdout and stderr file the job is using, we will log them in another
# file while we `tail -f` them (this is neccessary due to a file corruption
# bug in the supercomputer, just assume it makes sense)
outf="$(scontrol show job $jobid | awk -F= '/StdOut=/{print $2}')"
errf="$(scontrol show job $jobid | awk -F= '/StdErr=/{print $2}')"
logf="${outf}.bkp"
# wait for job to start
echo "### Waiting for job $jobid to start..."
until [ -f "$outf" ]; do sleep 5; done
# ~~~~ HERE COMES THE PART IN QUESTION ~~~~ #
# Once it started, start printing the content of stdout and stderr
# and copy them into the log file
echo "### Job $jobid started, stdout and stderr:"
tail -f -n 100000 $outf $errf | tee $logf &
tail_pid=$! # catch the pid of the child process
# watch for job end (the job id disappears from the queue; consider this
# detection working), and kill the tail process
while : ; do
sleep 5
if [[ -z "$(squeue | grep $jobid)" ]]; then
echo "### Job $jobid finished!"
kill -2 $tail_pid
break
fi
done
我还尝试了tail
主进程中的另一个版本,并且while
循环在子进程中运行,一旦作业结束,它就会杀死主进程,但它没有成功。不管怎样,脚本永远不会终止。
答案1
感谢@Paul_Pedant 的评论,我设法找到了问题。正如我在原始脚本中通过管道传递tail
到的那样,包含 的 PID ,而不是,因此 only被杀死。后者得到了,但显然不足以阻止它。tee
$!
tee
tail
tee
$SIGPIPE
解决方案在以下答案中:https://stackoverflow.com/a/8048493/5099168
在我的脚本中实现,相关行采用以下形式:
tail -f -n 100000 $outf $errf > >(tee $logf) &
tail_pid=$!