为什么我打开 IPVS 模式后就无法访问 Kubernetes？

2024-6-2 • tag-icon

问题是，在中启用 IPVS 模式后kube-proxy，一切都正常。但是，一旦安装 Traefik，我就会立即失去与 Kubernetes 的连接。

操作系统：CentOS 7.9

$ uname -rs
Linux 3.10.0-1160.71.1.el7.x86_64

Kubernetes：1.22.2 CNI：Calico

$ kubectl get nodes
NAME                         STATUS   ROLES                  AGE   VERSION
srv-dev-is-elt-01            Ready    control-plane,master   21h   v1.22.2
srv-dev-is-zabbix-01         Ready    control-plane,master   21h   v1.22.2
srv-nt-bpmtest-postgres-01   Ready    control-plane,master   21h   v1.22.2
srv-nt-bpmtest-postgres-02   Ready    <none>                 21h   v1.22.2
srv-rnt-rrsys-minio          Ready    <none>                 21h   v1.22.2

下面我将展示与 Kubernetes 失去连接之后的操作顺序。

在所有节点上启用对 IPVS 的虚拟服务器支持

$ lsmod | grep -e ip_vs -e nf_conntrack_ipv4
nf_conntrack_ipv4      15053  32 
nf_defrag_ipv4         12729  1 nf_conntrack_ipv4
ip_vs_lc               12516  0 
ip_vs_sh               12688  0 
ip_vs_wrr              12697  0 
ip_vs_rr               12600  0 
ip_vs                 145458  8 ip_vs_lc,ip_vs_rr,ip_vs_sh,ip_vs_wrr
nf_conntrack          139264  10 ip_vs,nf_nat,nf_nat_ipv4,nf_nat_ipv6,xt_conntrack,nf_nat_masquerade_ipv4,nf_nat_masquerade_ipv6,nf_conntrack_netlink,nf_conntrack_ipv4,nf_conntrack_ipv6
libcrc32c              12644  3 ip_vs,nf_nat,nf_conntrack

我确保 Traefik 未通过 Helm Chart 安装

$ helm list -A
NAME    NAMESPACE   REVISION    UPDATED STATUS  CHART   APP VERSION

然后我在 kube-proxy 中打开 IPVS 模式

kubectl edit configmap kube-proxy -n kube-system
    iptables:
      masqueradeAll: false
      masqueradeBit: null
      minSyncPeriod: 0s
      syncPeriod: 0s
    ipvs:
      excludeCIDRs: null
      minSyncPeriod: 0s
      scheduler: "lc"
      strictARP: false
      syncPeriod: 0s
      tcpFinTimeout: 0s
      tcpTimeout: 0s
      udpTimeout: 0s
    kind: KubeProxyConfiguration
    metricsBindAddress: ""
    mode: "ipvs"
    nodePortAddresses: null
    oomScoreAdj: null

这里，不是mode: ""我已经指定mode: "ipvs"，而是scheduler: ""我已经指定scheduler: "lc"。平衡模式：最小连接。

在查看了代理多维数据集中的日志后，我确定 IPVS 模式已成功启用。该行Using ipvs Proxier。

I0816 02:53:24.689691       1 node.go:172] Successfully retrieved node IP: 172.24.17.16                                                                                                                     │
│ I0816 02:53:24.689748       1 server_others.go:140] Detected node IP 172.24.17.16                                                                                                                           │
│ I0816 02:53:24.744103       1 server_others.go:206] kube-proxy running in dual-stack mode, IPv4-primary                                                                                                     │
│ I0816 02:53:24.744172       1 server_others.go:274] Using ipvs Proxier.                                                                                                                                     │
│ I0816 02:53:24.744205       1 server_others.go:276] creating dualStackProxier for ipvs.                                                                                                                     │
│ W0816 02:53:24.744254       1 server_others.go:495] detect-local-mode set to ClusterCIDR, but no IPv6 cluster CIDR defined, , defaulting to no-op detect-local for IPv6                                     │
│ E0816 02:53:24.744604       1 proxier.go:381] "can't set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1"                                                                           │
│ E0816 02:53:24.745339       1 proxier.go:381] "can't set sysctl net/ipv4/vs/conn_reuse_mode, kernel version must be at least 4.1"                                                                           │
│ W0816 02:53:24.745504       1 ipset.go:113] ipset name truncated; [KUBE-6-LOAD-BALANCER-SOURCE-CIDR] -> [KUBE-6-LOAD-BALANCER-SOURCE-CID]                                                                   │
│ W0816 02:53:24.745543       1 ipset.go:113] ipset name truncated; [KUBE-6-NODE-PORT-LOCAL-SCTP-HASH] -> [KUBE-6-NODE-PORT-LOCAL-SCTP-HAS]                                                                   │
│ I0816 02:53:24.745919       1 server.go:649] Version: v1.22.2                                                                                                                                               │
│ I0816 02:53:24.753389       1 conntrack.go:52] Setting nf_conntrack_max to 262144                                                                                                                           │
│ I0816 02:53:24.753935       1 config.go:315] Starting service config controller                                                                                                                             │
│ I0816 02:53:24.753967       1 config.go:224] Starting endpoint slice config controller                                                                                                                      │
│ I0816 02:53:24.753988       1 shared_informer.go:240] Waiting for caches to sync for service config                                                                                                         │
│ I0816 02:53:24.754010       1 shared_informer.go:240] Waiting for caches to sync for endpoint slice config                                                                                                  │
│ E0816 02:53:24.759251       1 event_broadcaster.go:253] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"srv-nt-bpmtest-postgres-01.170bb3a40a │
│ I0816 02:53:24.854413       1 shared_informer.go:247] Caches are synced for endpoint slice config                                                                                                           │
│ I0816 02:53:24.854494       1 shared_informer.go:247] Caches are synced for service config

我还检查了所有连接现在都像lc

$ sudo ipvsadm -ln
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.96.0.1:443 lc
  -> 172.24.16.225:6443           Masq    1      0          0         
  -> 172.24.16.226:6443           Masq    1      0          0         
  -> 172.24.17.16:6443            Masq    1      1          0         
TCP  10.96.0.10:53 lc
  -> 192.168.236.7:53             Masq    1      0          0         
  -> 192.168.236.10:53            Masq    1      0          0         
TCP  10.96.0.10:9153 lc
  -> 192.168.236.7:9153           Masq    1      0          0         
  -> 192.168.236.10:9153          Masq    1      0          0         
TCP  10.101.219.53:9094 lc
  -> 192.168.236.8:9094           Masq    1      0          0         
TCP  10.105.217.135:443 lc
  -> 192.168.236.6:5443           Masq    1      0          0         
  -> 192.168.236.9:5443           Masq    1      0          0         
TCP  10.107.33.224:5473 lc
  -> 172.24.16.225:5473           Masq    1      0          0         
  -> 172.24.17.16:5473            Masq    1      0          0         
  -> 172.24.17.17:5473            Masq    1      0          0         
UDP  10.96.0.10:53 lc
  -> 192.168.236.7:53             Masq    1      0          0         
  -> 192.168.236.10:53            Masq    1      0          0

我再次检查是否与 Kubernetes 建立了连接

$ kubectl get all -A
NAMESPACE          NAME                                                     READY   STATUS    RESTARTS        AGE
calico-apiserver   pod/calico-apiserver-6b9c675d9-9kwgs                     1/1     Running   198 (64m ago)   21h
calico-apiserver   pod/calico-apiserver-6b9c675d9-9lkpj                     1/1     Running   198 (64m ago)   21h
calico-system      pod/calico-kube-controllers-6f875db9f6-lkz5q             1/1     Running   198 (65m ago)   21h
calico-system      pod/calico-node-9lwdx                                    1/1     Running   2 (73m ago)     111m
calico-system      pod/calico-node-fwx2q                                    1/1     Running   3 (64m ago)     111m
calico-system      pod/calico-node-jfmpn                                    1/1     Running   2 (73m ago)     112m
calico-system      pod/calico-node-nm2wv                                    1/1     Running   1 (99m ago)     112m
calico-system      pod/calico-node-rfslp                                    1/1     Running   1 (100m ago)    112m
calico-system      pod/calico-typha-694b7cc975-4gwdp                        1/1     Running   2 (100m ago)    21h
calico-system      pod/calico-typha-694b7cc975-9w7rd                        1/1     Running   9 (73m ago)     21h
calico-system      pod/calico-typha-694b7cc975-kchjm                        1/1     Running   21 (64m ago)    21h
kube-system        pod/coredns-78fcd69978-4fnhn                             1/1     Running   6 (64m ago)     21h
kube-system        pod/coredns-78fcd69978-r4wf5                             1/1     Running   6 (64m ago)     21h
kube-system        pod/kube-apiserver-srv-dev-is-elt-01                     1/1     Running   215 (68m ago)   21h
kube-system        pod/kube-apiserver-srv-dev-is-zabbix-01                  1/1     Running   201 (68m ago)   21h
kube-system        pod/kube-apiserver-srv-nt-bpmtest-postgres-01            1/1     Running   217 (64m ago)   21h
kube-system        pod/kube-controller-manager-srv-dev-is-elt-01            1/1     Running   15 (73m ago)    21h
kube-system        pod/kube-controller-manager-srv-dev-is-zabbix-01         1/1     Running   8 (73m ago)     21h
kube-system        pod/kube-controller-manager-srv-nt-bpmtest-postgres-01   1/1     Running   14 (64m ago)    21h
kube-system        pod/kube-proxy-49xzk                                     1/1     Running   2 (99m ago)     21h
kube-system        pod/kube-proxy-ftrdk                                     1/1     Running   2 (73m ago)     21h
kube-system        pod/kube-proxy-jj5zw                                     1/1     Running   2 (73m ago)     21h
kube-system        pod/kube-proxy-pht8d                                     1/1     Running   2 (100m ago)    21h
kube-system        pod/kube-proxy-pwgnm                                     1/1     Running   3 (64m ago)     106m
kube-system        pod/kube-scheduler-srv-dev-is-elt-01                     1/1     Running   16 (73m ago)    21h
kube-system        pod/kube-scheduler-srv-dev-is-zabbix-01                  1/1     Running   8 (73m ago)     21h
kube-system        pod/kube-scheduler-srv-nt-bpmtest-postgres-01            1/1     Running   16 (64m ago)    21h
tigera-operator    pod/tigera-operator-57b5454687-2rfmt                     1/1     Running   15 (64m ago)    21h

NAMESPACE          NAME                                      TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)                  AGE
calico-apiserver   service/calico-api                        ClusterIP   10.105.217.135   <none>        443/TCP                  21h
calico-system      service/calico-kube-controllers-metrics   ClusterIP   10.101.219.53    <none>        9094/TCP                 21h
calico-system      service/calico-typha                      ClusterIP   10.107.33.224    <none>        5473/TCP                 21h
default            service/kubernetes                        ClusterIP   10.96.0.1        <none>        443/TCP                  21h
kube-system        service/kube-dns                          ClusterIP   10.96.0.10       <none>        53/UDP,53/TCP,9153/TCP   21h

NAMESPACE       NAME                         DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
calico-system   daemonset.apps/calico-node   5         5         5       5            5           kubernetes.io/os=linux   21h
kube-system     daemonset.apps/kube-proxy    5         5         5       5            5           kubernetes.io/os=linux   21h

NAMESPACE          NAME                                      READY   UP-TO-DATE   AVAILABLE   AGE
calico-apiserver   deployment.apps/calico-apiserver          2/2     2            2           21h
calico-system      deployment.apps/calico-kube-controllers   1/1     1            1           21h
calico-system      deployment.apps/calico-typha              3/3     3            3           21h
kube-system        deployment.apps/coredns                   2/2     2            2           21h
tigera-operator    deployment.apps/tigera-operator           1/1     1            1           21h

NAMESPACE          NAME                                                 DESIRED   CURRENT   READY   AGE
calico-apiserver   replicaset.apps/calico-apiserver-6b9c675d9           2         2         2       21h
calico-system      replicaset.apps/calico-kube-controllers-6f875db9f6   1         1         1       21h
calico-system      replicaset.apps/calico-typha-694b7cc975              3         3         3       21h
kube-system        replicaset.apps/coredns-78fcd69978                   2         2         2       21h
tigera-operator    replicaset.apps/tigera-operator-57b5454687           1         1         1       21h

最后，我尝试安装 Traefik，但首先values.yaml我将编辑该externalIPs行并在那里添加我的集群的 IP 地址数组：

  loadBalancerSourceRanges: []
    # - 192.168.0.1/32
    # - 172.16.0.0/16
  externalIPs:
    - 172.24.17.16 # master1
    - 172.24.16.226 # master2
    - 172.24.16.225 #master3

我开始安装 Traefik

$ helm install traefik traefik/ -n traefik
Error: failed post-install: warning: Hook post-install traefik/templates/dashboard-hook-ingressroute.yaml failed: rpc error: code = Unavailable desc = error reading from server: read tcp 172.24.17.16:38452->172.24.16.225:2379: read: connection reset by peer

和：

$ kubectl get all -A
Error from server: etcdserver: request timed out
Error from server: etcdserver: request timed out
Error from server: etcdserver: request timed out
Error from server: etcdserver: request timed out
The connection to the server 172.24.18.188:6443 was refused - did you specify the right host or port?
The connection to the server 172.24.18.188:6443 was refused - did you specify the right host or port?
The connection to the server 172.24.18.188:6443 was refused - did you specify the right host or port?
The connection to the server 172.24.18.188:6443 was refused - did you specify the right host or port?
The connection to the server 172.24.18.188:6443 was refused - did you specify the right host or port?
The connection to the server 172.24.18.188:6443 was refused - did you specify the right host or port?

etcd：

[root@srv-nt-bpmtest-postgres-01 m.kostromin]# systemctl status etcd
● etcd.service - Etcd Server
   Loaded: loaded (/usr/lib/systemd/system/etcd.service; enabled; vendor preset: disabled)
   Active: active (running) since Tue 2022-08-16 07:58:43 MSK; 15s ago
 Main PID: 3719 (etcd)
    Tasks: 21
   Memory: 56.0M
   CGroup: /system.slice/etcd.service
           └─3719 /usr/bin/etcd --name=etcd1 --data-dir=/opt/etcd-data/etcd1.etcd --listen-client-urls=https://172.24.17.16:2379,https://127.0.0.1:2379

Aug 16 07:58:43 srv-nt-bpmtest-postgres-01 etcd[3719]: serving client requests on 127.0.0.1:2379
Aug 16 07:58:43 srv-nt-bpmtest-postgres-01 etcd[3719]: serving client requests on 172.24.17.16:2379
Aug 16 07:58:46 srv-nt-bpmtest-postgres-01 bash[3719]: proto: no coders for int
Aug 16 07:58:46 srv-nt-bpmtest-postgres-01 bash[3719]: proto: no encoder for ValueSize int [GetProperties]
Aug 16 07:58:48 srv-nt-bpmtest-postgres-01 etcd[3719]: health check for peer c721fffd85ddc9e0 could not connect: dial tcp 172.24.16.226:2380: connect: no route to host (prober "ROUND_TRIPPER_SNAPSHOT")
Aug 16 07:58:48 srv-nt-bpmtest-postgres-01 etcd[3719]: health check for peer c721fffd85ddc9e0 could not connect: dial tcp 172.24.16.226:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Aug 16 07:58:53 srv-nt-bpmtest-postgres-01 etcd[3719]: health check for peer c721fffd85ddc9e0 could not connect: dial tcp 172.24.16.226:2380: i/o timeout (prober "ROUND_TRIPPER_SNAPSHOT")
Aug 16 07:58:53 srv-nt-bpmtest-postgres-01 etcd[3719]: health check for peer c721fffd85ddc9e0 could not connect: dial tcp 172.24.16.226:2380: i/o timeout (prober "ROUND_TRIPPER_RAFT_MESSAGE")
Aug 16 07:58:58 srv-nt-bpmtest-postgres-01 etcd[3719]: health check for peer c721fffd85ddc9e0 could not connect: dial tcp 172.24.16.226:2380: connect: no route to host (prober "ROUND_TRIPPER_SNAPSHOT")
Aug 16 07:58:58 srv-nt-bpmtest-postgres-01 etcd[3719]: health check for peer c721fffd85ddc9e0 could not connect: dial tcp 172.24.16.226:2380: connect: no route to host (prober "ROUND_TRIPPER_RAFT_MESSAGE")

请告诉我可能存在什么问题？

相关内容