我有一个多主节点的 K8S 集群。我不得不删除一些主节点,并对kubeadm delete m2
第三个节点 ( ) 执行了相同的操作m3
,这样我就可以只有一个主节点,稍后再重新加入其他节点。然而,这不知何故弄乱了主节点 ( m1
),现在出现了以下错误:
Jan 12 08:56:29 k8s-m1 kubelet[14734]: E0112 08:56:29.314499 14734
eviction_manager.go:256] "Eviction manager: failed to get summary
stats" err="failed to get node info: node \"k8s-m1\" not found"
Jan 12 08:53:15 k8s-m1 kubelet[14734]: E0112 08:53:15.552154 14734
kubelet.go:2448] "Error getting node" err="node \"k8s-m1\" not
found"
Jan 12 08:56:29 k8s-m1 kubelet[14734]: E0112 08:56:29.571175 14734
event.go:276] Unable to write event:
'&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""},
ObjectMeta:v1.ObjectMeta{Name:"k8s-m1.1739835c5d7370d9",
GenerateName:"", Namespace:"default", SelfLink:"", UID:"",
ResourceVersion:"", Generation:0, CreationTimestamp:time.Date(1,
time.January, 1, 0, 0, 0, 0, time.UTC), DeletionTimestamp:<nil>,
DeletionGracePeriodSeconds:(*int64)(nil),
Labels:map[string]string(nil), Annotations:map[string]string(nil),
OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil),
ManagedFields:[]v1.ManagedFieldsEntry(nil)},
InvolvedObject:v1.ObjectReference{Kind:"Node", Namespace:"",
Name:"k8s-m1", UID:"k8s-m1", APIVersion:"", ResourceVersion:"",
FieldPath:""}, Reason:"NodeAllocatableEnforced", Message:"Updated
Node Allocatable limit across pods",
Source:v1.EventSource{Component:"kubelet", Host:"k8s-m1"},
FirstTimestamp:time.Date(2023, time.January, 12, 8, 46, 9,
272926425, time.Local), LastTimestamp:time.Date(2023, time.January,
12, 8, 46, 9, 272926425, time.Local), Count:1, Type:"Normal",
EventTime:time.Date(1, time.January, 1, 0, 0, 0, 0, time.UTC),
Series:(*v1.EventSeries)(nil), Action:"",
Related:(*v1.ObjectReference)(nil), ReportingController:"",
ReportingInstance:""}': 'Post
"https://10.10.40.30:6443/api/v1/namespaces/default/events":
EOF'(may retry after sleeping)
该 IP 地址是负载均衡器的 IP 地址。
有没有办法重新激活这个主节点,这样我就不必重新创建整个集群?
答案1
您可以遵循以下一些故障排除步骤:
尝试删除
var/lib/kubelet
并重新安装 kubelet。重新启动 kubelet 和 docker 服务。
sudo 服务 docker 重启
sudo systemctl 重启 kubelet
您还可以参考此文档kubeadm 故障排除。
答案2
由于这是一个 HA 设置,我别无选择,只能重新创建集群。