将非 Hetzner 提供商节点添加到 Rancher 集群时出现问题:卡在“等待代理签入并应用初始计划”

将非 Hetzner 提供商节点添加到 Rancher 集群时出现问题:卡在“等待代理签入并应用初始计划”

我已成功在 Hetzner 提供的五台服务器中的一台上部署 Rancher 服务器。对于集群配置,我将剩余的一台服务器指定为控制平面和 etcd 角色,其他三台服务器设置为工作节点。此配置产生了一个运行 Kubernetes 版本 v1.27.11+k3s1 的功能集群。但是,当我尝试从 Hetzner 以外的提供商添加工作节点时遇到问题。具体来说,该过程卡在“等待代理签入并应用初始计划”阶段。

为了诊断潜在的连接问题,我使用 telnet 和 dig 从我尝试添加的服务器执行了网络连接测试,但未发现任何明显问题。用于添加工作节点的命令如下:

curl -fL https://rancher.mydomain.org/system-agent-install.sh | sudo sh -s - --server https://rancher.mydomain.org --label 'cattle.io/os=linux' --token ******* --worker

使用 sudo journalctl -u k3s-agent --no-pager 检查系统日志,我注意到系统尝试执行几次操作,但遇到与获取 CA 证书相关的错误:

root@ubuntu:~# sudo journalctl -u k3s-agent --no-pager
Mar 16 12:12:48 ubuntu systemd[1]: k3s-agent.service: Succeeded.
Mar 16 12:12:48 ubuntu systemd[1]: Stopped Lightweight Kubernetes.
Mar 16 12:12:48 ubuntu systemd[1]: Starting Lightweight Kubernetes...
Mar 16 12:12:48 ubuntu sh[885689]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 16 12:12:48 ubuntu k3s[885707]: time="2024-03-16T12:12:48Z" level=info msg="Starting k3s agent v1.27.11+k3s1 (06d6bc80)"
Mar 16 12:12:48 ubuntu k3s[885707]: time="2024-03-16T12:12:48Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: [2a01:4f8:c012:3140::1]:6443"
Mar 16 12:12:48 ubuntu k3s[885707]: time="2024-03-16T12:12:48Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [[2a01:4f8:c012:3140::1]:6443] [default: [2a01:4f8:c012:3140::1]:6443]"
Mar 16 12:12:48 ubuntu k3s[885707]: time="2024-03-16T12:12:48Z" level=error msg="failed to get CA certs: Get \"https://127.0.0.1:6444/cacerts\": read tcp 127.0.0.1:45838->127.0.0.1:6444: read: connection reset by peer"

相反,使用相同方法从 Hetzner 添加服务器时,该过程成功完成,没有任何类似问题。为了进行比较,以下是成功添加 Hetzner 服务器的日志摘录:

root@dev-k3s-worker-3:~# sudo journalctl -u k3s-agent --no-pager
Mar 15 14:57:53 dev-k3s-worker-3 systemd[1]: Starting Lightweight Kubernetes...
Mar 15 14:57:53 dev-k3s-worker-3 sh[1391]: + /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service
Mar 15 14:57:53 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:53Z" level=info msg="Acquiring lock file /var/lib/rancher/k3s/data/.lock"
Mar 15 14:57:53 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:53Z" level=info msg="Preparing data dir /var/lib/rancher/k3s/data/*****"
Mar 15 14:57:54 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:54Z" level=info msg="Starting k3s agent v1.27.11+k3s1 (06d6bc80)"
Mar 15 14:57:54 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:54Z" level=info msg="Adding server to load balancer k3s-agent-load-balancer: [2a01:4f8:c012:3140::1]:6443"
Mar 15 14:57:54 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:54Z" level=info msg="Running load balancer k3s-agent-load-balancer 127.0.0.1:6444 -> [[2a01:4f8:c012:3140::1]:6443] [default: [2a01:4f8:c012:3140::1]:6443]"
Mar 15 14:57:54 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:54Z" level=warning msg="Cluster CA certificate is not trusted by the host CA bundle, but the token does not include a CA hash. Use the full token from the server's node-token file to enable Cluster CA validation."
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Using private registry config file at /etc/rancher/k3s/registries.yaml"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Module overlay was already loaded"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Module br_netfilter was already loaded"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Set sysctl 'net/ipv4/conf/all/forwarding' to 1"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_max' to 131072"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_established' to 86400"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Set sysctl 'net/netfilter/nf_conntrack_tcp_timeout_close_wait' to 3600"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Logging containerd to /var/lib/rancher/k3s/agent/containerd/containerd.log"
Mar 15 14:57:55 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:55Z" level=info msg="Running containerd -c /var/lib/rancher/k3s/agent/etc/containerd/config.toml -a /run/k3s/containerd/containerd.sock --state /run/k3s/containerd --root /var/lib/rancher/k3s/agent/containerd"
Mar 15 14:57:56 dev-k3s-worker-3 k3s[1396]: time="2024-03-15T14:57:56Z" level=info msg="containerd is now running"

... bla bla bla

这次成功的添加与其他提供商的服务器遇到的失败形成了鲜明的对比,突显了似乎发生在 Hetzner 环境之外的独特问题。

我原本以为非 Hetzner 服务器可以作为工作节点毫无问题地添加到集群中,就像添加 Hetzner 服务器一样。我尝试使用 Rancher 集群扩展提供的标准命令添加它们,并假设根据我的测试,网络连接不是问题。然而,我遇到了一个持续的“等待代理签入并应用初始计划”消息,并在日志中看到与 CA 证书获取相关的错误。

答案1

由于在牧场主机器上分配了 IPv6,其他机器无法从外部访问,因为它们无法解析 IPv6 DNS 条目。

相关内容