遇到指标服务器无法启动的问题。初始部署是 2021(版本 v0.6.1),并且已经运行了几年。崩溃恢复后,指标服务器无法启动,日志中出现 TLS 错误。我尝试重新部署当前版本(v0.6.3)和旧版本(v0.6.1),但遇到了同样的问题。
部署状态
kube-state-metrics-898575cdb-rwrsq 1/1 Running 0 23d 10.233.92.183 node3 <none> <none>
metrics-server-68c5fc6c44-676zj 0/1 Running 0 7m37s 10.233.96.18 node2 <none> <none>
我认为问题出在指标存储上——在经历了下面的所有内容之后,我在探测 readyz 条件时发现了这一点
[-]metric-storage-ready failed: reason withheld
查看日志显示 tls 错误,但我认为这只是症状,而不是原因 -
$ kubectl logs metrics-server-68c5fc6c44-676zj -nkube-system
Error from server: Get "https://10.0.92.31:10250/containerLogs/kube-system/metrics-server-68c5fc6c44-676zj/metrics-server": remote error: tls: internal error
搜索后我发现Kubernetes 指标服务器出现 SSL 问题从几年前开始。我验证了我们曾经并且现在都在使用 --kubelet-insecure-tls 标志。
Pod 描述
$ kubectl describe deployment metrics-server -nkube-system
Name: metrics-server
Namespace: kube-system
CreationTimestamp: Tue, 28 Mar 2023 11:37:53 -0400
Labels: k8s-app=metrics-server
Annotations: deployment.kubernetes.io/revision: 2
Selector: k8s-app=metrics-server
Replicas: 1 desired | 1 updated | 1 total | 0 available | 1 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 0 max unavailable, 25% max surge
Pod Template:
Labels: k8s-app=metrics-server
Service Account: metrics-server
Containers:
metrics-server:
Image: k8s.gcr.io/metrics-server/metrics-server:v0.6.1
Port: 4443/TCP
Host Port: 0/TCP
Args:
--cert-dir=/tmp
--secure-port=4443
--kubelet-preferred-address-types=InternalIP
--kubelet-use-node-status-port
--metric-resolution=15s
--kubelet-insecure-tls
Requests:
cpu: 100m
memory: 200Mi
Liveness: http-get https://:https/livez delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:https/readyz delay=20s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/tmp from tmp-dir (rw)
Volumes:
tmp-dir:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium:
SizeLimit: <unset>
Priority Class Name: system-cluster-critical
Conditions:
Type Status Reason
---- ------ ------
Progressing True NewReplicaSetAvailable
Available False MinimumReplicasUnavailable
OldReplicaSets: <none>
NewReplicaSet: metrics-server-68c5fc6c44 (1/1 replicas created)
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ScalingReplicaSet 34m deployment-controller Scaled up replica set metrics-server-6594d67d48 to 1
Normal ScalingReplicaSet 13m deployment-controller Scaled down replica set metrics-server-6594d67d48 to 0
Normal ScalingReplicaSet 13m deployment-controller Scaled down replica set metrics-server-68c5fc6c44 to 0
Normal ScalingReplicaSet 13m (x2 over 15m) deployment-controller Scaled up replica set metrics-server-68c5fc6c44 to 1
现在,进一步的搜索导致Metrics-server 处于 CrashLoopBackOff 状态,由 rke 新安装
在此,检查响应利维兹和准备
以下是我得到的 -
$ time curl -k https://10.233.96.18:4443/livez
ok
real 0m0.019s
user 0m0.000s
sys 0m0.010s
$ time curl -k https://10.233.96.18:4443/readyz
[+]ping ok
[+]log ok
[+]poststarthook/generic-apiserver-start-informers ok
[+]informer-sync ok
[+]poststarthook/max-in-flight-filter ok
[-]metric-storage-ready failed: reason withheld
[+]metadata-informer-sync ok
[+]shutdown ok
readyz check failed
real 0m0.013s
user 0m0.009s
sys 0m0.000s
现在的问题是 -[-]metric-storage-ready 失败:原因未知
那是什么?这就是其部署失败的原因吗?