
我在 LAN 上运行本地三节点 Kubernetes 集群。这三个节点分别是我的路由器、我的媒体服务器和我的 PC。PC 运行 Pop!_OS,但我很难设置它。
我曾经kubeadm
启动集群并将节点连接在一起,并且正在使用 Calico CNI。
第一个问题是我似乎无法永久禁用 Pop!_OS 上的交换,因为我可以运行,sudo swapoff -a
但重新启动后它会重新启用。通常,这就像删除 中的条目一样简单/etc/fstab
,但是,在这种情况下没有要删除的交换条目。
我的第一个问题是:如何永久禁用 Pop!_OS 上的交换?
现在,Kubelet 处于崩溃循环中,直到禁用交换并重新启动:
sudo swapoff -a
sudo systemctl restart kubelet
这允许系统、Calico 和 Nvidia pod 等开始运行:
-> % kubectl get pods -A -o wide | grep pop-os
calico-system calico-node-spxz4 1/1 Running 4 (2d12h ago) 5d9h 10.0.0.235 pop-os <none> <none>
calico-system csi-node-driver-cvw7l 2/2 Running 8 (2d12h ago) 5d9h 192.168.179.103 pop-os <none> <none>
default gpu-feature-discovery-8rx9w 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.105 pop-os <none> <none>
default gpu-operator-1711318735-node-feature-discovery-worker-mc5xv 1/1 Running 5 (2d12h ago) 5d9h 192.168.179.99 pop-os <none> <none>
default nvidia-container-toolkit-daemonset-tmjt9 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.106 pop-os <none> <none>
default nvidia-cuda-validator-ndcr4 0/1 Completed 0 19m <none> pop-os <none> <none>
default nvidia-dcgm-exporter-th8w4 1/1 Running 4 (25h ago) 5d9h 192.168.179.101 pop-os <none> <none>
default nvidia-device-plugin-daemonset-66576 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.102 pop-os <none> <none>
default nvidia-operator-validator-sl5kc 1/1 Running 4 (2d12h ago) 5d9h 192.168.179.100 pop-os <none> <none>
kube-system kube-proxy-mjncv 1/1 Running 5 (2d12h ago) 5d9h 10.0.0.235 pop-os <none> <none>
...但是,我尝试在该节点上安排的 GPU 指标导出器没有运行:
-> % kubectl describe pod nvidia-exporter-pc-5b78bdcd6d-vmtq4
Name: nvidia-exporter-pc-5b78bdcd6d-vmtq4
Namespace: default
Priority: 0
Runtime Class Name: nvidia
Service Account: default
Node: <none>
Labels: app=nvidia-exporter-pc
pod-template-hash=5b78bdcd6d
Annotations: <none>
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/nvidia-exporter-pc-5b78bdcd6d
Containers:
nvidia-exporter-pc:
Image: utkuozdemir/nvidia_gpu_exporter:1.2.0
Port: 9835/TCP
Host Port: 0/TCP
Environment:
NVIDIA_VISIBLE_DEVICES: all
Mounts:
/dev/nvidia0 from nvidia0 (rw)
/dev/nvidiactl from nvidiactl (rw)
/usr/bin/nvidia-smi from nvidia-smi (rw)
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so from nvidia-lib (rw)
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1 from nvidia-lib-1 (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-khzrv (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
nvidiactl:
Type: HostPath (bare host directory volume)
Path: /dev/nvidiactl
HostPathType:
nvidia0:
Type: HostPath (bare host directory volume)
Path: /dev/nvidia0
HostPathType:
nvidia-smi:
Type: HostPath (bare host directory volume)
Path: /usr/bin/nvidia-smi
HostPathType:
nvidia-lib:
Type: HostPath (bare host directory volume)
Path: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so
HostPathType:
nvidia-lib-1:
Type: HostPath (bare host directory volume)
Path: /usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
HostPathType:
kube-api-access-khzrv:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
QoS Class: BestEffort
Node-Selectors: kubernetes.io/hostname=pop-os
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 30s default-scheduler 0/3 nodes are available: 1 node(s) were unschedulable, 2 node(s) didn't match Pod's node affinity/selector. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
我怀疑这是由于节点描述中出现的警告造成的:
Warning InvalidDiskCapacity 24m kubelet invalid capacity 0 on image filesystem
...然而,研究表明,这通常是一个无害的警告,应该迅速解决,并且不会阻止任何安排。以下是完整的节点描述:
Name: pop-os
Roles: <none>
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
feature.node.kubernetes.io/cpu-cpuid.ADX=true
feature.node.kubernetes.io/cpu-cpuid.AESNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX=true
feature.node.kubernetes.io/cpu-cpuid.AVX2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BITALG=true
feature.node.kubernetes.io/cpu-cpuid.AVX512BW=true
feature.node.kubernetes.io/cpu-cpuid.AVX512CD=true
feature.node.kubernetes.io/cpu-cpuid.AVX512DQ=true
feature.node.kubernetes.io/cpu-cpuid.AVX512F=true
feature.node.kubernetes.io/cpu-cpuid.AVX512IFMA=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VBMI2=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VL=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VNNI=true
feature.node.kubernetes.io/cpu-cpuid.AVX512VPOPCNTDQ=true
feature.node.kubernetes.io/cpu-cpuid.CMPXCHG8=true
feature.node.kubernetes.io/cpu-cpuid.FLUSH_L1D=true
feature.node.kubernetes.io/cpu-cpuid.FMA3=true
feature.node.kubernetes.io/cpu-cpuid.FSRM=true
feature.node.kubernetes.io/cpu-cpuid.FXSR=true
feature.node.kubernetes.io/cpu-cpuid.FXSROPT=true
feature.node.kubernetes.io/cpu-cpuid.GFNI=true
feature.node.kubernetes.io/cpu-cpuid.IA32_ARCH_CAP=true
feature.node.kubernetes.io/cpu-cpuid.IBPB=true
feature.node.kubernetes.io/cpu-cpuid.LAHF=true
feature.node.kubernetes.io/cpu-cpuid.MD_CLEAR=true
feature.node.kubernetes.io/cpu-cpuid.MOVBE=true
feature.node.kubernetes.io/cpu-cpuid.MPX=true
feature.node.kubernetes.io/cpu-cpuid.OSXSAVE=true
feature.node.kubernetes.io/cpu-cpuid.PSFD=true
feature.node.kubernetes.io/cpu-cpuid.SHA=true
feature.node.kubernetes.io/cpu-cpuid.SPEC_CTRL_SSBD=true
feature.node.kubernetes.io/cpu-cpuid.STIBP=true
feature.node.kubernetes.io/cpu-cpuid.SYSCALL=true
feature.node.kubernetes.io/cpu-cpuid.SYSEE=true
feature.node.kubernetes.io/cpu-cpuid.VAES=true
feature.node.kubernetes.io/cpu-cpuid.VMX=true
feature.node.kubernetes.io/cpu-cpuid.VPCLMULQDQ=true
feature.node.kubernetes.io/cpu-cpuid.X87=true
feature.node.kubernetes.io/cpu-cpuid.XGETBV1=true
feature.node.kubernetes.io/cpu-cpuid.XSAVE=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEC=true
feature.node.kubernetes.io/cpu-cpuid.XSAVEOPT=true
feature.node.kubernetes.io/cpu-cpuid.XSAVES=true
feature.node.kubernetes.io/cpu-cstate.enabled=true
feature.node.kubernetes.io/cpu-hardware_multithreading=true
feature.node.kubernetes.io/cpu-model.family=6
feature.node.kubernetes.io/cpu-model.id=167
feature.node.kubernetes.io/cpu-model.vendor_id=Intel
feature.node.kubernetes.io/cpu-pstate.scaling_governor=powersave
feature.node.kubernetes.io/cpu-pstate.status=active
feature.node.kubernetes.io/cpu-pstate.turbo=true
feature.node.kubernetes.io/kernel-config.NO_HZ=true
feature.node.kubernetes.io/kernel-config.NO_HZ_IDLE=true
feature.node.kubernetes.io/kernel-version.full=6.8.0-76060800daily20240311-generic
feature.node.kubernetes.io/kernel-version.major=6
feature.node.kubernetes.io/kernel-version.minor=8
feature.node.kubernetes.io/kernel-version.revision=0
feature.node.kubernetes.io/pci-10de.present=true
feature.node.kubernetes.io/pci-8086.present=true
feature.node.kubernetes.io/storage-nonrotationaldisk=true
feature.node.kubernetes.io/system-os_release.ID=pop
feature.node.kubernetes.io/system-os_release.VERSION_ID=22.04
feature.node.kubernetes.io/system-os_release.VERSION_ID.major=22
feature.node.kubernetes.io/system-os_release.VERSION_ID.minor=04
feature.node.kubernetes.io/usb-ef_043e_9a39.present=true
feature.node.kubernetes.io/usb-ef_046d_081b.present=true
kubernetes.io/arch=amd64
kubernetes.io/hostname=pop-os
kubernetes.io/os=linux
nvidia.com/cuda.driver.major=550
nvidia.com/cuda.driver.minor=67
nvidia.com/cuda.driver.rev=
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=4
nvidia.com/gfd.timestamp=1711817271
nvidia.com/gpu-driver-upgrade-state=pod-restart-required
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=6
nvidia.com/gpu.count=1
nvidia.com/gpu.deploy.container-toolkit=true
nvidia.com/gpu.deploy.dcgm=true
nvidia.com/gpu.deploy.dcgm-exporter=true
nvidia.com/gpu.deploy.device-plugin=true
nvidia.com/gpu.deploy.driver=pre-installed
nvidia.com/gpu.deploy.gpu-feature-discovery=true
nvidia.com/gpu.deploy.node-status-exporter=true
nvidia.com/gpu.deploy.operator-validator=true
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=MS-7D09
nvidia.com/gpu.memory=10240
nvidia.com/gpu.present=true
nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3080
nvidia.com/gpu.replicas=1
nvidia.com/mig.capable=false
nvidia.com/mig.strategy=single
Annotations: csi.volume.kubernetes.io/nodeid: {"csi.tigera.io":"pop-os"}
kubeadm.alpha.kubernetes.io/cri-socket: unix:///var/run/containerd/containerd.sock
nfd.node.kubernetes.io/feature-labels:
cpu-cpuid.ADX,cpu-cpuid.AESNI,cpu-cpuid.AVX,cpu-cpuid.AVX2,cpu-cpuid.AVX512BITALG,cpu-cpuid.AVX512BW,cpu-cpuid.AVX512CD,cpu-cpuid.AVX512DQ...
nfd.node.kubernetes.io/worker.version: v0.14.2
node.alpha.kubernetes.io/ttl: 0
nvidia.com/gpu-driver-upgrade-enabled: true
projectcalico.org/IPv4Address: 10.0.0.235/24
projectcalico.org/IPv4VXLANTunnelAddr: 192.168.179.64
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Mon, 25 Mar 2024 00:16:19 -0700
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: pop-os
AcquireTime: <unset>
RenewTime: Sat, 30 Mar 2024 10:12:10 -0700
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
NetworkUnavailable False Sat, 30 Mar 2024 09:47:37 -0700 Sat, 30 Mar 2024 09:47:37 -0700 CalicoIsUp Calico is running on this node
MemoryPressure False Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Sat, 30 Mar 2024 10:11:30 -0700 Sat, 30 Mar 2024 09:47:34 -0700 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 10.0.0.235
Hostname: pop-os
Capacity:
cpu: 16
ephemeral-storage: 238222068Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32777432Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 16
ephemeral-storage: 219545457506
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 32675032Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: 709fbcb158d7dd28973351156441d28c
System UUID: 43a79490-8c6e-4b1e-ac81-d8bbc1049bdf
Boot ID: 5af7fdf9-5d8b-44bd-934a-46d2f0c379e1
Kernel Version: 6.8.0-76060800daily20240311-generic
OS Image: Pop!_OS 22.04 LTS
Operating System: linux
Architecture: amd64
Container Runtime Version: containerd://1.7.2
Kubelet Version: v1.28.8
Kube-Proxy Version: v1.28.8
PodCIDR: 192.168.2.0/24
PodCIDRs: 192.168.2.0/24
Non-terminated Pods: (9 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits Age
--------- ---- ------------ ---------- --------------- ------------- ---
calico-system calico-node-spxz4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
calico-system csi-node-driver-cvw7l 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default gpu-feature-discovery-8rx9w 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default gpu-operator-1711318735-node-feature-discovery-worker-mc5xv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-container-toolkit-daemonset-tmjt9 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-dcgm-exporter-th8w4 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-device-plugin-daemonset-66576 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
default nvidia-operator-validator-sl5kc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
kube-system kube-proxy-mjncv 0 (0%) 0 (0%) 0 (0%) 0 (0%) 5d9h
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 0 (0%) 0 (0%)
memory 0 (0%) 0 (0%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
nvidia.com/gpu 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 24m kube-proxy
Normal NodeNotSchedulable 24m kubelet Node pop-os status is now: NodeNotSchedulable
Normal NodeReady 24m (x2 over 24m) kubelet Node pop-os status is now: NodeReady
Normal NodeHasSufficientMemory 24m (x3 over 24m) kubelet Node pop-os status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 24m (x3 over 24m) kubelet Node pop-os status is now: NodeHasNoDiskPressure
Warning InvalidDiskCapacity 24m kubelet invalid capacity 0 on image filesystem
Warning Rebooted 24m (x2 over 24m) kubelet Node pop-os has been rebooted, boot id: 5af7fdf9-5d8b-44bd-934a-46d2f0c379e1
Normal NodeAllocatableEnforced 24m kubelet Updated Node Allocatable limit across pods
Normal Starting 24m kubelet Starting kubelet.
Normal NodeHasSufficientPID 24m (x3 over 24m) kubelet Node pop-os status is now: NodeHasSufficientPID
Normal Starting 16m kubelet Starting kubelet.
Warning InvalidDiskCapacity 16m kubelet invalid capacity 0 on image filesystem
Normal NodeAllocatableEnforced 16m kubelet Updated Node Allocatable limit across pods
Normal NodeHasSufficientMemory 16m kubelet Node pop-os status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 16m kubelet Node pop-os status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 16m kubelet Node pop-os status is now: NodeHasSufficientPID
Normal NodeNotSchedulable 16m kubelet Node pop-os status is now: NodeNotSchedulable
我该如何解决无法调度的污点并在 PC 上运行 GPU 导出器(和其他)pod?我对 Kubernetes 还很陌生,所以这个问题有点超出我的理解范围。