我编写了一个网络守护进程,它分叉子进程来处理 TCP 连接。在SIGINT
主进程上kill
为每个子进程触发一个,以便清理并收集一些最终统计数据。
几乎在所有情况下都可以正常工作,并且子进程终止得非常快。然而,有时子进程拒绝在短超时(如 5 秒)内终止。
我不知道当时发生了什么,所以我添加了一些详细的输出来诊断这种情况。我发现使用netcat
打开连接,然后暂停该netcat
进程,有时造成效果。
当我能够重现效果时,调试输出为:
REST-server(cleanup_queue): deleting children
REST-server(cleanup_queue): deleting PID 23344 handling localhost:48114
child_delete: Killing child 23344
child_delete: killed child with PID 23344
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting up to 5 seconds for condition
_limited_wait(PID 23344 terminated): waiting 0.02 (of 5 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.04 (of 4.98 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.08 (of 4.94 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.16 (of 4.86 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.32 (of 4.7 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 0.64 (of 4.38 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 1.28 (of 3.74 remaining) seconds
(r1, r2) = (1, Interrupted system call)
_limited_wait(PID 23344 terminated): waiting 2.46 (of 2.46 remaining) seconds
(r1, r2) = (1, Interrupted system call)
child_delete: PID 23344 refused to terminate within 5s
failed to delete child PID 23344
在这种情况下等待的“条件”是此关闭的结果:
sub {
my $r1 = kill(0, $child_pid);
my $r2 = $!;
print "(r1, r2) = ($r1, $r2)\n";
$r1 != 1 && $r2 == Errno::ESRCH;
}
因此,预期的结果是主进程无法“杀死”PID,因为它不再存在(而不是因为“权限被拒绝”)。
然而,由于某些原因,我反复收到“系统调用中断”的消息。
主进程使用这样的信号处理程序:
$SIG{'INT'} = $SIG{'TERM'} = sub ($) {
my $signal = 'SIG' . $_[0];
my $me = "signal handler[$$, $signal]";
print "$me: cleaning up\n"
if ($verbose > 0);
cleanup();
print "$me: executing default action\n"
if ($verbose > 1);
$SIG{$_[0]} = 'DEFAULT';
kill($_[0], $$); # execute default action
};
当分叉子进程时,我重置信号处理程序,如下所示:
sub child_create($)
{
my ($child) = @_;
my $pid;
reaper(0); # disable for the child
if ($pid = fork()) { # parent
reaper(1); # enable for the parent
} elsif (defined($pid)) { # child
my ($child_fun, @child_param) = @$child;
my $ret;
# prevent double-cleanup
$SIG{'INT'} = $SIG{'TERM'} = $SIG{'__DIE__'} = 'DEFAULT';
$ret = $child_fun->(@child_param);
exit($ret); # avoid returning from function call
} else { # error
print STDERR "child_create: fork(): $!\n";
}
return $pid;
}
reaper()
刚刚的把手SIGCHLD
。
什么可能导致所看到的效果?子进程基本上执行while (defined(my $req = $conn->get_request)) {...}
(using HTTP::Daemon
),因此它们应该等待输入netcat
。
附加信息
以防万一,操作系统是在 VMware 上运行的 SLES12 SP5(使用 Perl 5.18.2)。
主服务器循环中的代码如下所示:
while (defined(my $conn = $daemon->accept) || $! == Errno::EINTR) {
my $errno = $!;
if ($quit_flag != 0) {
last;
}
if ($errno == Errno::EINTR) {
next;
}
#... handle $req->uri->path()
}