我见过的最奇怪的docker故障

我见过的最奇怪的docker故障

我在其中一台服务器上使用 Docker 和 docker-mailserver。将一些服务从旧版 Debian Jessie 服务器迁移到 Ubuntu 16.04 LTS 服务器后出现了非常奇怪的问题。服务器参数:

遗产:

someuser@legacyserver:~$ uname -r
3.16.0-4-amd64
someuser@legacyserver:~$ dpkg -l | grep systemd
...215-17+deb8u7...
someuser@legacyserver:~$ cat /proc/cmdline
root=ZFS=rpool/ROOT/debian-1 ro boot=zfs quiet

新服务器:

someuser@newserver:~$ uname -r
4.4.0-21-generic
someuser@newserver:~$ dpkg -l | grep systemd
...229-4ubuntu4...
someuser@newserver:~$ cat /proc/cmdline
root=ZFS=rpool/ROOT/debian-1 apparmor=0 ro

我在 systemd-nspawn Debian Jessie 容器中的 docker 上运行 docker-mailserver。我遇到的第一个问题是新 systemd 上的只读 cgroups,这解决了该问题:

mount | grep cgroup | tail -n +2 | while read line
do
    umount -l $(echo $line | cut -f3 -d" ")
    mount -t $(echo $line | cut -f5 -d" ") -o $(echo $line | cut -f6 -    d" " | rev | cut -c2- | rev | cut -c2- | sed -e 's/ro,/rw,/g') $(echo     $line | cut -f1 -d" ") $(echo $line | cut -f3 -d" ")
done

它只是以读写方式重新挂载所有 cgroup(不能使用 -o remount)。

但是,首先我要 rsh 到 systemd-nspawn 容器,然后从容器到 docker 容器。例如,当我重新加载 Postfix(或执行其他任何操作)时... 两个容器(嵌套的 docker 和 systemd-nspawn)都安静地退出... 就像这样:

someuser@newserver:~# rsh somesystemdcontainer
Last login: Sun Jun 25 15:27:24 CEST 2017 from host0 on pts/0
Linux somesystemdcontainer 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:33:37 UTC     2016 x86_64

The programs included with the Debian GNU/Linux system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Debian GNU/Linux comes with ABSOLUTELY NO WARRANTY, to the extent
permitted by applicable law.
root@somesystemdcontainer:~# rsh mail #this is the docker container
Last login: Sun Jun 25 13:28:18 UTC 2017 from 172.18.0.1 on pts/0
Welcome to Ubuntu 14.04.5 LTS (GNU/Linux 4.4.0-21-generic x86_64)

 * Documentation:  https://help.ubuntu.com/
root@mail:~# service postfix reload
 * Reloading Postfix configuration...
   ...done.
root@mail:~# rlogin: connection closed.
root@newserver:~#

DMESG 中没有任何内容,内核日志中没有任何内容,任何地方都没有任何内容。正如您在 cmdline 中看到的那样,在内核和用户空间端禁用 apparmor 都无济于事……尝试停止 systemd-nspawn 容器后:

jun 25 15:32:26 newserver kernel: INFO: task sh:10962 blocked for more than 120 seconds.
jun 25 15:32:26 newserver kernel:       Tainted: P           O    4.4.0-21-generic #37-Ubuntu
jun 25 15:32:26 newserver kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jun 25 15:32:26 newserver kernel: sh              D ffff88009ebb3c88     0 10962   9487 0x00000102
jun 25 15:32:26 newserver kernel:  ffff88009ebb3c88 0000000000000000 ffff88040dab3700 ffff8800c9450dc0
jun 25 15:32:26 newserver kernel:  ffff88009ebb4000 ffff8800c08008b0 0000000000000001 ffff8800c9450dc0
jun 25 15:32:26 newserver kernel:  ffff8800c2fe87e8 ffff88009ebb3ca0 ffffffff818203f5 ffff8800c9450dc0
jun 25 15:32:26 newserver kernel: Call Trace:
jun 25 15:32:26 newserver kernel:  [<ffffffff818203f5>] schedule+0x35/0x80
jun 25 15:32:26 newserver kernel:  [<ffffffff8111fd4f>] zap_pid_ns_processes+0x13f/0x1a0
jun 25 15:32:26 newserver kernel:  [<ffffffff8108432b>] do_exit+0xa6b/0xae0
jun 25 15:32:26 newserver kernel:  [<ffffffff8122383f>] ? dput+0x2f/0x220
jun 25 15:32:26 newserver kernel:  [<ffffffff81084423>] do_group_exit+0x43/0xb0
jun 25 15:32:26 newserver kernel:  [<ffffffff810904d2>] get_signal+0x292/0x600
jun 25 15:32:26 newserver kernel:  [<ffffffff8102e517>] do_signal+0x37/0x6f0
jun 25 15:32:26 newserver kernel:  [<ffffffff8181fd36>] ? __schedule+0x386/0xa10
jun 25 15:32:26 newserver kernel:  [<ffffffff81083526>] ? do_wait+0x116/0x240
jun 25 15:32:26 newserver kernel:  [<ffffffff8100320c>] exit_to_usermode_loop+0x8c/0xd0
jun 25 15:32:26 newserver kernel:  [<ffffffff81003c5e>] syscall_return_slowpath+0x4e/0x60
jun 25 15:32:26 newserver kernel:  [<ffffffff81824650>] int_ret_from_sys_call+0x25/0x8f
jun 25 15:32:53 newserver systemd[1]: [email protected]: State 'stop-sigterm' timed out. Killing.
jun 25 15:32:53 newserver systemd-nspawn[9483]: somesystemdcontainer login:
jun 25 15:32:53 newserver systemd[1]: [email protected]: Main process exited, code=killed, status=9/KILL
jun 25 15:32:53 newserver systemd[1]: Stopped Container somesystemdcontainer.
jun 25 15:32:53 newserver systemd[1]: [email protected]: Unit entered failed state.
jun 25 15:32:53 newserver systemd[1]: [email protected]: Failed with result 'signal'.
jun 25 15:32:53 newserver systemd[1]: Stopped Container somesystemdcontainer.
jun 25 15:32:53 newserver systemd-machined[2890]: Machine somesystemdcontainer terminated.

10962 是... DOCKER 容器内的 bash,它在 pstree 上“跳出命名空间”...

我现在应该怎么做?

相关内容