Kubernetes 节点指标端点返回 401

Kubernetes 节点指标端点返回 401

我有一个 GKE 集群,为了简单起见,它只运行 Prometheus,监控每个成员节点。最近,我将 API 服务器升级到 1.6(引入了 RBAC),没有出现任何问题。然后,我添加了一个新节点,运行 1.6 版 kubelet。Prometheus 无法访问这个新节点的指标 API。

Prometheus 目标页面

因此,我在命名空间中添加了 和,ClusterRole并将部署配置为使用新的 ServiceAccount。然后,我删除了 pod 以进行妥善处理:ClusterRoleBindingServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups: [""]
  resources:
  - configmaps
  verbs: ["get"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: default
---

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: default
secrets:
- name: prometheus-token-xxxxx

---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  labels:
    app: prometheus-prometheus
    component: server
    release: prometheus
  name: prometheus-server
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: prometheus-prometheus
      component: server
      release: prometheus
  strategy:
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 1
    type: RollingUpdate
  template:
    metadata:
      labels:
        app: prometheus-prometheus
        component: server
        release: prometheus
    spec:
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      serviceAccount: prometheus
      serviceAccountName: prometheus
      ...

但情况依然没有改变。

指标端点返回HTTP/1.1 401 Unauthorized,当我修改部署以包含另一个安装了 bash + curl 的容器并手动发出请求时,我得到:

# curl -vsSk -H "Authorization: Bearer $(</var/run/secrets/kubernetes.io/serviceaccount/token)" https://$NODE_IP:10250/metrics
*   Trying $NODE_IP...
* Connected to $NODE_IP ($NODE_IP) port 10250 (#0)
* found XXX certificates in /etc/ssl/certs/ca-certificates.crt
* found XXX certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
*    server certificate verification SKIPPED
*    server certificate status verification SKIPPED
*    common name: node-running-kubelet-1-6@000000000 (does not match '$NODE_IP')
*    server certificate expiration date OK
*    server certificate activation date OK
*    certificate public key: RSA
*    certificate version: #3
*    subject: CN=node-running-kubelet-1-6@000000000
*    start date: Fri, 07 Apr 2017 22:00:00 GMT
*    expire date: Sat, 07 Apr 2018 22:00:00 GMT
*    issuer: CN=node-running-kubelet-1-6@000000000
*    compression: NULL
* ALPN, server accepted to use http/1.1
> GET /metrics HTTP/1.1
> Host: $NODE_IP:10250
> User-Agent: curl/7.47.0
> Accept: */*
> Authorization: Bearer **censored**
>
< HTTP/1.1 401 Unauthorized
< Date: Mon, 10 Apr 2017 20:04:20 GMT
< Content-Length: 12
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host $NODE_IP left intact
  • 为什么该令牌不允许我访问该资源?
  • 如何检查授予 ServiceAccount 的访问权限?

答案1

我遇到了同样的问题并创建了票证https://github.com/prometheus/prometheus/issues/2606为此,并通过 PR 更新了配置示例https://github.com/prometheus/prometheus/pull/2641

您可以看到更新后的重新标记kubernetes 节点工作于https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml#L76-L84

复制以供参考:

  relabel_configs:
  - action: labelmap
    regex: __meta_kubernetes_node_label_(.+)
  - target_label: __address__
    replacement: kubernetes.default.svc:443
  - source_labels: [__meta_kubernetes_node_name]
    regex: (.+)
    target_label: __metrics_path__
    replacement: /api/v1/nodes/${1}/proxy/metrics

对于 RBAC 本身,你需要使用你创建的自己的服务账户来运行 Prometheus

apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: default

确保将该服务帐户传递到具有以下 pod 规范的 pod 中:

spec:
  serviceAccount: prometheus

然后,Kubernetes 清单设置适当的 RBAC 角色和绑定,以授予 prometheus 服务帐户访问所需 API 端点的权限https://github.com/prometheus/prometheus/blob/master/documentation/examples/rbac-setup.yml

复制以供参考

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
  verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: prometheus
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: prometheus
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: prometheus
subjects:
- kind: ServiceAccount
  name: prometheus
  namespace: default

将所有清单中的命名空间替换为运行 Prometheus 的命名空间,然后使用具有 Cluster Admin 权限的帐户应用该清单。

我还没有在没有 ABAC 回退的集群中测试过这一点,所以 RBAC 角色可能仍然缺少一些重要的东西。

答案2

按照@JorritSalverda 的票上的讨论;https://github.com/prometheus/prometheus/issues/2606#issuecomment-294869099

由于 GKE 不允许您获取允许您使用 kubelet 进行身份验证的客户端证书,因此对于 GKE 上的用户来说,最好的解决方案似乎是使用 kubernetes API 服务器作为对节点的代理请求。

要做到这一点(引用@JorritSalverda);

“对于在 GKE 中运行的 Prometheus 服务器,我现在使用以下重新标记来运行它:

relabel_configs:
- action: labelmap
  regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
  replacement: kubernetes.default.svc.cluster.local:443
- target_label: __scheme__
  replacement: https
- source_labels: [__meta_kubernetes_node_name]
  regex: (.+)
  target_label: __metrics_path__
  replacement: /api/v1/nodes/${1}/proxy/metrics

并将以下 ClusterRole 绑定到 Prometheus 使用的服务帐户:

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: prometheus
rules:
- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]

因为 GKE 集群在 RBAC 失败的情况下仍然具有 ABAC 回退,所以我不能 100% 确定这是否涵盖了所有必需的权限。

相关内容