我想从 systemd 启动 docker 中的 etcd(单节点),但似乎出了点问题——它在启动后大约 30 秒终止。
看起来服务以状态启动“激活”但大约 30 秒后就终止了,没有达到状态“积极的”. 也许 docker 容器和 systemd 之间缺少信号?
更新(参见帖子底部):failed (Result: timeout)
当我删除该Restart=on-failure
指令时,systemd 服务状态达到 - 。
当我启动后检查 etcd 服务的状态时,得到了以下结果:
$ sudo systemctl status etcd● etcd.service - etcd Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2021-08-18 20:13:30 UTC; 4s ago
Process: 2971 ExecStart=/usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 /usr/local/bin/etcd --data-dir=/etcd-data --name etcd0 --advertise-client-urls http://10.0.0.11:2379 --listen-client-urls http://0.0.0.0:2379 --initial-advertise-peer-urls http://10.0.0.11:2380 --listen-peer-urls http://0.0.0.0:2380 --initial-cluster etcd0=http://10.0.0.11:2380 (code=exited, status=125)
Main PID: 2971 (code=exited, status=125)
我在 Amazon Linux 2 机器上运行此程序,并在启动时运行用户数据脚本。我已确认docker.service
并docker_ecr_login.service
成功运行。
启动机器后不久,我可以看到 etcd 正在运行:
sudo systemctl status etcd
● etcd.service - etcd
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: activating (start) since Wed 2021-08-18 20:30:07 UTC; 1min 20s ago
Main PID: 1573 (docker)
Tasks: 9
Memory: 24.3M
CGroup: /system.slice/etcd.service
└─1573 /usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com...
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.690Z","logger":"raft","caller":"...rm 2"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.691Z","caller":"etcdserver/serve..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"membership/clust..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"etcdserver/server.go:2...
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"api/capability.g..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"etcdserver/serve..."3.5"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.693Z","caller":"embed/serve.go:9...ests"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.695Z","caller":"etcdmain/main.go...emon"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.695Z","caller":"etcdmain/main.go...emon"}
Aug 18 20:30:17 ip-10-0-0-11.eu-north-1.compute.internal docker[1573]: {"level":"info","ts":"2021-08-18T20:30:17.702Z","caller":"embed/serve.go:1...2379"}
Hint: Some lines were ellipsized, use -l to show in full.
无论 etcd 监听节点 IP(10.0.0.11)还是 127.0.0.1,我都会得到相同的行为。
我可以在本地运行 etcd,从命令行启动(30 秒后仍未终止), 和:
sudo docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd-local \
my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 \
/usr/local/bin/etcd --data-dir=/etcd-data \
--name etcd0 \
--advertise-client-urls http://127.0.0.1:2379 \
--listen-client-urls http://0.0.0.0:2379 \
--initial-advertise-peer-urls http://127.0.0.1:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-cluster etcd0=http://127.0.0.1:2380
etcd 的参数类似于运行单节点 etcd - ectd 3.5 文档。
这是用于启动 etcd 的启动脚本的相关部分:
sudo docker volume create --name etcd-data
cat <<EOF | sudo tee /etc/systemd/system/etcd.service
[Unit]
Description=etcd
After=docker_ecr_login.service
[Service]
Type=notify
ExecStart=/usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data \
--name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 \
/usr/local/bin/etcd --data-dir=/etcd-data \
--name etcd0 \
--advertise-client-urls http://10.0.0.11:2379 \
--listen-client-urls http://0.0.0.0:2379 \
--initial-advertise-peer-urls http://10.0.0.11:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--initial-cluster etcd0=http://10.0.0.11:2380
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl enable etcd
sudo systemctl start etcd
当列出机器上的所有容器时,我可以看到它一直在运行:
sudo docker ps -a
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a744aed0beb1 my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 "/usr/local/bin/etcd…" 25 minutes ago Exited (0) 24 minutes ago etcd
但我怀疑由于容器名称已经存在,因此无法重新启动。
为什么从 systemd 启动 etcd 容器后约 30 秒就会终止?看起来它已成功启动,但 systemd 仅显示其处于“正在激活”状态,而从未显示“活动”状态,并且它似乎在大约 30 秒后终止。从 etcd docker 容器到 systemd 是否缺少一些信号?如果是,我该如何正确获取该信号?
更新:
删除Restart=on-failure
服务单元文件内的指令后,我现在得到状态:失败(结果:超时):
$ sudo systemctl status etcd
● etcd.service - etcd
Loaded: loaded (/etc/systemd/system/etcd.service; enabled; vendor preset: disabled)
Active: failed (Result: timeout) since Wed 2021-08-18 21:35:54 UTC; 5min ago
Process: 1567 ExecStart=/usr/bin/docker run -p 2380:2380 -p 2379:2379 --volume=etcd-data:/etcd-data --name etcd my-aws-account.dkr.ecr.eu-north-1.amazonaws.com/etcd:v3.5.0 /usr/local/bin/etcd --data-dir=/etcd-data --name etcd0 --advertise-client-urls http://127.0.0.1:2379 --listen-client-urls http://0.0.0.0:2379 --initial-advertise-peer-urls http://127.0.0.1:2380 --listen-peer-urls http://0.0.0.0:2380 --initial-cluster etcd0=http://127.0.0.1:2380 (code=exited, status=0/SUCCESS)
Main PID: 1567 (code=exited, status=0/SUCCESS)
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.332Z","caller":"osutil/interrupt...ated"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.333Z","caller":"embed/etcd.go:36...379"]}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: WARNING: 2021/08/18 21:35:54 [core] grpc: addrConn.createTransport failed ...ing...
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.335Z","caller":"etcdserver/serve...6a6c"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.337Z","caller":"embed/etcd.go:56...2380"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.338Z","caller":"embed/etcd.go:56...2380"}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal docker[1567]: {"level":"info","ts":"2021-08-18T21:35:54.339Z","caller":"embed/etcd.go:36...379"]}
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal systemd[1]: Failed to start etcd.
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal systemd[1]: Unit etcd.service entered failed state.
Aug 18 21:35:54 ip-10-0-0-11.eu-north-1.compute.internal systemd[1]: etcd.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
答案1
更新:发布测试数据并根据收到的评论整合更新。docker -d 不是 systemd 集成所必需的,这与最初的想法不同。根据我的经验,Michael 指出的 Type= 设置似乎比将服务的守护进程状态卸载到 docker 更重要。乍一看,OP 问题似乎是没有后台运行的副作用,正如我最初解释的那样。经过进一步测试后,这个后台似乎无关紧要。
请注意,OP 中使用的 Amazon AWS 映像不是我可以测试或直接排除故障的。这里显示了 etcd 和 systemd 的对比示例,以帮助配置与我的类似的端点系统。系统详细信息:
- Ubuntu 20.04 LTS
- docker 20.10.7
- etcd 3.5.0
systemd 配置
我最终得到了以下 systemd 服务文件。请注意 Type=simple,这是由于 Michael 建议在回复中澄清这一点(显然,这也是我自己对这个难题的理解)。您可以在此处了解有关 systemd 类型的更多信息:
https://www.freedesktop.org/software/systemd/man/systemd.service.html
类型很重要;更重要的是,我最初对简单类型的理解,短视地集中于缺乏与 systemd 的沟通,这导致我忽略了适用的行为类型设置对被调用应用程序(在本例中为 docker)的响应做出什么反应。
删除类型或将类型添加到 simple 都会导致相同的行为。在我的测试中,以下配置运行可靠,docker run 命令中是否存在 -d 也一样:
[Unit]
Description=Docker container-etcd.service
Documentation=man:docker
Requires=docker.service
Wants=network.target
After=network-online.target
[Service]
ExecStartPre=- /usr/bin/docker stop etcd
ExecStartPre=- /usr/bin/docker rm etcd
ExecStart=docker run --rm -d -p 2379:2379 -p 2380:2380 --volume=/home/user/etcd-data:/etcd-data --name etcd quay.io/coreos/etcd:v3.5.0 /usr/local/bin/etcd --data-dir=/etcd-data --name etcd --initial-advertise-peer-urls http://10.4.4.132:2380 --listen-peer-urls http://0.0.0.0:2380 --advertise-client-urls http://10.4.4.132:2379 --listen-client-urls http://0.0.0.0:2379 --initial-cluster etcd=http://10.4.4.132:2380
ExecStop=/usr/bin/docker stop etcd -t 10
ExecRestart=/usr/bin/docker restart etcd
KillMode=none
RemainAfterExit=1
Restart=on-failure
Type=simple
[Install]
WantedBy=multi-user.target default.target
笔记
- 添加了 RemainAfterExit,因为如果不存在,systemd 将在启动后认为服务已退出;缺少这个布尔值会造成一种看似错误的情况,即
docker ps
显示容器正在运行,但systemctl status container-etcd
实际上显示已退出且不活动。 - systemd 单元文件在语法上有些不正确。%n 通常用于 Exec 行来引用服务名称(如 ...docker restart %n);我不想在尝试解决 OP 的问题时造成进一步的混乱。更不用说我使用 etcd 作为 docker 容器名称,而不是使用 container-etcd 作为单元服务名称。
- ExecStart 被压缩为一行命令。\ 标准语法对我来说不起作用,引用 etcd call 命令到容器也不起作用。我昨天的测试似乎运行良好,但今天的配置与昨天的表现不一样。所以我重新进行了测试和配置,以找到对我来说最稳定的配置。
- 显然,如果你要在任何时候使用 docker rm,你必须或非常强烈应该使用绑定挂载,如 OP 中所述,这里使用 --volume。我个人使用完整路径位置,全部存储在 /srv 下,然后绑定挂载到容器中。这样我就有一个要备份的文件夹,而容器的状态(是否存在)无关紧要。
确认
更新 systemd 服务文件、执行守护进程重新加载等之后,我进入容器并针对 etcd 运行了测试命令:
docker exec -it etcd sh
etcdctl --endpoints=http://10.4.4.132:2379 member list
结果
9a552f9b95628384, started, etcd, http://10.4.4.132:2380, http://10.4.4.132:2379, false