我以前从未安装过 Rancher,但我正尝试在本地 HA RKE2 集群上设置 Rancher 环境。我有一台 F5 作为负载均衡器,它设置为处理端口 80、443、6443 和 9345。名为 rancher-demo.localdomain.local 的 DNS 记录指向负载均衡器的 IP 地址。我想提供自己的证书文件,并通过我们的内部 CA 创建了这样的证书。
集群本身已投入运行,并且正常运行。当我在除第一个节点之外的其他节点上运行安装时,它们使用指向 LB IP 的 DNS 名称,因此我知道 LB 的这一部分正常运行。
kubectl get nodes
NAME STATUS ROLES AGE VERSION
rancher0001.localdomain.local Ready control-plane,etcd,master 25h v1.26.12+rke2r1
rancher0002.localdomain.local Ready control-plane,etcd,master 25h v1.26.12+rke2r1
rancher0003.localdomain.local Ready control-plane,etcd,master 25h v1.26.12+rke2r1
在安装 Rancher 之前,我运行了以下命令:
kubectl create namespace cattle-system
kubectl -n cattle-system create secret tls tls-rancher-ingress --cert=~/tls.crt --key=~/tls.key
kubectl -n cattle-system create secret generic tls-ca --from-file=cacerts.pem=~/cacerts.pem
最后,我安装了 Rancher:
helm install rancher rancher-stable/rancher --namespace cattle-system --set hostname=rancher-demo.localdomain.local --set bootstrapPassword=passwordgoeshere --set ingress.tls.source=secret --set privateCA=true
我不记得错误是什么了,但我确实在运行安装后不久看到了超时错误。它确实执行了*部分*安装:
kubectl -n cattle-system rollout status deploy/rancher
deployment "rancher" successfully rolled out
kubectl get ns
NAME STATUS AGE
cattle-fleet-clusters-system Active 5h18m
cattle-fleet-system Active 5h24m
cattle-global-data Active 5h25m
cattle-global-nt Active 5h25m
cattle-impersonation-system Active 5h24m
cattle-provisioning-capi-system Active 5h6m
cattle-system Active 5h29m
cluster-fleet-local-local-1a3d67d0a899 Active 5h18m
default Active 25h
fleet-default Active 5h25m
fleet-local Active 5h26m
kube-node-lease Active 25h
kube-public Active 25h
kube-system Active 25h
local Active 5h25m
p-c94zp Active 5h24m
p-m64sb Active 5h24m
kubectl get pods --all-namespaces
NAMESPACE NAME READY STATUS RESTARTS AGE
cattle-fleet-system fleet-controller-56968b86b6-6xdng 1/1 Running 0 5h19m
cattle-fleet-system gitjob-7d68454468-tvcrt 1/1 Running 0 5h19m
cattle-system rancher-64bdc898c7-56fpm 1/1 Running 0 5h27m
cattle-system rancher-64bdc898c7-dl4cz 1/1 Running 0 5h27m
cattle-system rancher-64bdc898c7-z55lh 1/1 Running 1 (5h25m ago) 5h27m
cattle-system rancher-webhook-58d68fb97d-zpg2p 1/1 Running 0 5h17m
kube-system cloud-controller-manager-rancher0001.localdomain.local 1/1 Running 1 (22h ago) 25h
kube-system cloud-controller-manager-rancher0002.localdomain.local 1/1 Running 1 (22h ago) 25h
kube-system cloud-controller-manager-rancher0003.localdomain.local 1/1 Running 1 (22h ago) 25h
kube-system etcd-rancher0001.localdomain.local 1/1 Running 0 25h
kube-system etcd-rancher0002.localdomain.local 1/1 Running 3 (22h ago) 25h
kube-system etcd-rancher0003.localdomain.local 1/1 Running 3 (22h ago) 25h
kube-system kube-apiserver-rancher0001.localdomain.local 1/1 Running 0 25h
kube-system kube-apiserver-rancher0002.localdomain.local 1/1 Running 0 25h
kube-system kube-apiserver-rancher0003.localdomain.local 1/1 Running 0 25h
kube-system kube-controller-manager-rancher0001.localdomain.local 1/1 Running 1 (22h ago) 25h
kube-system kube-controller-manager-rancher0002.localdomain.local 1/1 Running 1 (22h ago) 25h
kube-system kube-controller-manager-rancher0003.localdomain.local 1/1 Running 0 25h
kube-system kube-proxy-rancher0001.localdomain.local 1/1 Running 0 25h
kube-system kube-proxy-rancher0002.localdomain.local 1/1 Running 0 25h
kube-system kube-proxy-rancher0003.localdomain.local 1/1 Running 0 25h
kube-system kube-scheduler-rancher0001.localdomain.local 1/1 Running 1 (22h ago) 25h
kube-system kube-scheduler-rancher0002.localdomain.local 1/1 Running 0 25h
kube-system kube-scheduler-rancher0003.localdomain.local 1/1 Running 0 25h
kube-system rke2-canal-2jngw 2/2 Running 0 25h
kube-system rke2-canal-6qrc4 2/2 Running 0 25h
kube-system rke2-canal-bk2f8 2/2 Running 0 25h
kube-system rke2-coredns-rke2-coredns-565dfc7d75-87pjr 1/1 Running 0 25h
kube-system rke2-coredns-rke2-coredns-565dfc7d75-wh64f 1/1 Running 0 25h
kube-system rke2-coredns-rke2-coredns-autoscaler-6c48c95bf9-mlcln 1/1 Running 0 25h
kube-system rke2-ingress-nginx-controller-6p8ll 1/1 Running 0 22h
kube-system rke2-ingress-nginx-controller-7pm5c 1/1 Running 0 5h22m
kube-system rke2-ingress-nginx-controller-brfwh 1/1 Running 0 22h
kube-system rke2-metrics-server-c9c78bd66-f5vrb 1/1 Running 0 25h
kube-system rke2-snapshot-controller-6f7bbb497d-vqg9s 1/1 Running 0 22h
kube-system rke2-snapshot-validation-webhook-65b5675d5c-dt22h 1/1 Running 0 22h
然而,显然(当我访问时出现 404 Not Found 页面)https://rancher-demo.localdomain.local)事情进展不顺利。
我之前从未设置过这个,所以我不知道如何解决这个问题。我花了几个小时浏览各种帖子,但似乎没有找到与这个特定问题相匹配的内容。
我发现了一些事情:
kubectl -n cattle-system logs -f rancher-64bdc898c7-56fpm
2024/01/17 21:13:23 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:13:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:13:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
(repeats every 15 seconds)
kubectl get ingress --all-namespaces
No resources found
(I *know* there was an ingress at some point, I believe in cattle-system; now it's gone. I didn't remove it.)
kubectl -n cattle-system describe service rancher
Name: rancher
Namespace: cattle-system
Labels: app=rancher
app.kubernetes.io/managed-by=Helm
chart=rancher-2.7.9
heritage=Helm
release=rancher
Annotations: meta.helm.sh/release-name: rancher
meta.helm.sh/release-namespace: cattle-system
Selector: app=rancher
Type: ClusterIP
IP Family Policy: SingleStack
IP Families: IPv4
IP: 10.43.199.3
IPs: 10.43.199.3
Port: http 80/TCP
TargetPort: 80/TCP
Endpoints: 10.42.0.26:80,10.42.1.22:80,10.42.1.23:80
Port: https-internal 443/TCP
TargetPort: 444/TCP
Endpoints: 10.42.0.26:444,10.42.1.22:444,10.42.1.23:444
Session Affinity: None
Events: <none>
kubectl -n cattle-system logs -l app=rancher
2024/01/17 21:17:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:17:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:18:08 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:18:23 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:18:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:18:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:19:08 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:19:23 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:19:38 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:19:53 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.23]: dial tcp 10.42.0.26:443: i/o timeout
2024/01/17 21:19:40 [ERROR] Failed to connect to peer wss://10.42.1.22/v3/connect [local ID=10.42.0.26]: dial tcp 10.42.1.22:443: i/o timeout
E0117 21:19:45.551484 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0117 21:19:45.646038 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2024/01/17 21:19:45 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
2024/01/17 21:19:49 [ERROR] [updateClusterHealth] Failed to update cluster [local]: Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.management.cattle.io": failed to call webhook: Post "https://rancher-webhook.cattle-system.svc:443/v1/webhook/mutation/clusters.management.cattle.io?timeout=10s": context deadline exceeded
E0117 21:19:52.882877 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0117 21:19:53.061671 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2024/01/17 21:19:53 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
2024/01/17 21:19:55 [ERROR] Failed to connect to peer wss://10.42.1.23/v3/connect [local ID=10.42.0.26]: dial tcp 10.42.1.23:443: i/o timeout
2024/01/17 21:19:55 [ERROR] Failed to connect to peer wss://10.42.1.22/v3/connect [local ID=10.42.0.26]: dial tcp 10.42.1.22:443: i/o timeout
E0117 21:19:37.826713 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0117 21:19:37.918579 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2024/01/17 21:19:37 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
E0117 21:19:45.604537 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0117 21:19:45.713901 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2024/01/17 21:19:45 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
2024/01/17 21:19:49 [ERROR] Failed to connect to peer wss://10.42.0.26/v3/connect [local ID=10.42.1.22]: dial tcp 10.42.0.26:443: i/o timeout
E0117 21:19:52.899035 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
E0117 21:19:52.968048 34 gvks.go:69] failed to sync schemas: unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request
2024/01/17 21:19:52 [ERROR] Failed to read API for groups map[metrics.k8s.io/v1beta1:the server is currently unable to handle the request]
我确信我做错了什么,但我不知道是什么,也不知道如何进一步排除故障。