使用 100GbE 网络的 Kubernetes 性能非常差

2024-6-2 • tag-icon

我们在服务器上使用 ConnectX-5 100GbE 以太网卡，这些卡通过 Mellanox 交换机相互连接。我们在 Kubernetes 集群上使用 weavenet cni 插件。当我们使用防火墙工具使用以下命令我们获得主机中的 100Gbps 连接速度。

# server host
host1 $ iperf -s -P8
# client host
host2 $ iperf -c <host_ip> -P8
Result: 98.8 Gbps transfer speed

另外，当我们在同一主机上使用两个 docker 容器用相同的工具和命令进行一些测试时，我们也得到了相同的结果。

# server host
host1$ docker run -it -p 5001:5001 ubuntu:latest-with-iperf iperf -s -P8 
# client host
host2 $ docker run -it -p 5001:5001 ubuntu:latest-with-iperf iperf -c <host_ip> -P8
Result: 98.8 Gbps transfer speed

但是，当我们在相同的主机（host1，host2）中使用相同的镜像创建两个不同的部署，并通过服务 ip 进行相同的测试（我们使用以下 yaml 创建了一个 k8s 服务）时，它会将流量重定向到服务器 pod，我们得到了唯一的2Gbps。我们还使用pod的集群ip和服务的集群域做了同样的测试，但结果是一样的。

kubectl create deployment iperf-server --image=ubuntu:latest-with-iperf  # after that we add affinity(host1) and container port sections to the yaml
kubectl create deployment iperf-client --image=ubuntu:latest-with-iperf  # after that we add affinity(host2) and container port sections to the yaml

kind: Service
apiVersion: v1
metadata:
  name: iperf-server
  namespace: default
spec:
  ports:
    - name: iperf
      protocol: TCP
      port: 5001
      targetPort: 5001
  selector:
    name: iperf-server
  clusterIP: 10.104.10.230
  type: ClusterIP
  sessionAffinity: None

TLDR；我们测试的场景：

主机 1（ubuntu 20.04，安装了 mellanox 驱动程序）<--------> 主机 2（ubuntu 20.04，安装了 mellanox 驱动程序）= 98.8 Gbps
主机 1 上的容器 1 <--------> 主机 2 上的容器 2 = 98.8 Gbps
Pod1-on-host1 <-------> Pod2-on-host2（使用集群 ip）= 2Gbps
Pod1-on-host1 <-------> Pod2-on-host2（使用服务集群 ip）= 2Gbps
Pod1-on-host1 <-------> Pod2-on-host2（使用服务集群域）= 2Gbps

我们需要在 pod 间通信中获得 100Gbps 的速度。那么是什么原因导致了这个问题呢？

更新1：

当我在 iperf 测试期间检查 pod 内部的 htop 时，发现有 112 个 cpu 核心，并且都没有遇到 CPU 问题。
当我将hostNetwork: true密钥添加到部署中时，pod 的带宽可以达到 100Gbps。

答案1

我们通过禁用 weavenet 上的加密来解决这个问题。但重启服务器就可以了。谢谢文章。

答案1

相关内容