我有一个 GKE 集群,为了简单起见,它只运行 Prometheus,监控每个成员节点。最近,我将 API 服务器升级到 1.6(引入了 RBAC),没有出现任何问题。然后,我添加了一个新节点,运行 1.6 版 kubelet。Prometheus 无法访问这个新节点的指标 API。
因此,我在命名空间中添加了 和,ClusterRole
并将部署配置为使用新的 ServiceAccount。然后,我删除了 pod 以进行妥善处理:ClusterRoleBinding
ServiceAccount
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- apiGroups: [""]
resources:
- configmaps
verbs: ["get"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: default
secrets:
- name: prometheus-token-xxxxx
---
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
labels:
app: prometheus-prometheus
component: server
release: prometheus
name: prometheus-server
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: prometheus-prometheus
component: server
release: prometheus
strategy:
rollingUpdate:
maxSurge: 1
maxUnavailable: 1
type: RollingUpdate
template:
metadata:
labels:
app: prometheus-prometheus
component: server
release: prometheus
spec:
dnsPolicy: ClusterFirst
restartPolicy: Always
schedulerName: default-scheduler
serviceAccount: prometheus
serviceAccountName: prometheus
...
但情况依然没有改变。
指标端点返回HTTP/1.1 401 Unauthorized
,当我修改部署以包含另一个安装了 bash + curl 的容器并手动发出请求时,我得到:
# curl -vsSk -H "Authorization: Bearer $(</var/run/secrets/kubernetes.io/serviceaccount/token)" https://$NODE_IP:10250/metrics
* Trying $NODE_IP...
* Connected to $NODE_IP ($NODE_IP) port 10250 (#0)
* found XXX certificates in /etc/ssl/certs/ca-certificates.crt
* found XXX certificates in /etc/ssl/certs
* ALPN, offering http/1.1
* SSL connection using TLS1.2 / ECDHE_RSA_AES_128_GCM_SHA256
* server certificate verification SKIPPED
* server certificate status verification SKIPPED
* common name: node-running-kubelet-1-6@000000000 (does not match '$NODE_IP')
* server certificate expiration date OK
* server certificate activation date OK
* certificate public key: RSA
* certificate version: #3
* subject: CN=node-running-kubelet-1-6@000000000
* start date: Fri, 07 Apr 2017 22:00:00 GMT
* expire date: Sat, 07 Apr 2018 22:00:00 GMT
* issuer: CN=node-running-kubelet-1-6@000000000
* compression: NULL
* ALPN, server accepted to use http/1.1
> GET /metrics HTTP/1.1
> Host: $NODE_IP:10250
> User-Agent: curl/7.47.0
> Accept: */*
> Authorization: Bearer **censored**
>
< HTTP/1.1 401 Unauthorized
< Date: Mon, 10 Apr 2017 20:04:20 GMT
< Content-Length: 12
< Content-Type: text/plain; charset=utf-8
<
* Connection #0 to host $NODE_IP left intact
- 为什么该令牌不允许我访问该资源?
- 如何检查授予 ServiceAccount 的访问权限?
答案1
我遇到了同样的问题并创建了票证https://github.com/prometheus/prometheus/issues/2606为此,并通过 PR 更新了配置示例https://github.com/prometheus/prometheus/pull/2641。
您可以看到更新后的重新标记kubernetes 节点工作于https://github.com/prometheus/prometheus/blob/master/documentation/examples/prometheus-kubernetes.yml#L76-L84
复制以供参考:
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
对于 RBAC 本身,你需要使用你创建的自己的服务账户来运行 Prometheus
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: default
确保将该服务帐户传递到具有以下 pod 规范的 pod 中:
spec:
serviceAccount: prometheus
然后,Kubernetes 清单设置适当的 RBAC 角色和绑定,以授予 prometheus 服务帐户访问所需 API 端点的权限https://github.com/prometheus/prometheus/blob/master/documentation/examples/rbac-setup.yml
复制以供参考
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
- nonResourceURLs: ["/metrics"]
verbs: ["get"]
---
apiVersion: v1
kind: ServiceAccount
metadata:
name: prometheus
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
name: prometheus
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: prometheus
subjects:
- kind: ServiceAccount
name: prometheus
namespace: default
将所有清单中的命名空间替换为运行 Prometheus 的命名空间,然后使用具有 Cluster Admin 权限的帐户应用该清单。
我还没有在没有 ABAC 回退的集群中测试过这一点,所以 RBAC 角色可能仍然缺少一些重要的东西。
答案2
按照@JorritSalverda 的票上的讨论;https://github.com/prometheus/prometheus/issues/2606#issuecomment-294869099
由于 GKE 不允许您获取允许您使用 kubelet 进行身份验证的客户端证书,因此对于 GKE 上的用户来说,最好的解决方案似乎是使用 kubernetes API 服务器作为对节点的代理请求。
要做到这一点(引用@JorritSalverda);
“对于在 GKE 中运行的 Prometheus 服务器,我现在使用以下重新标记来运行它:
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
- target_label: __address__
replacement: kubernetes.default.svc.cluster.local:443
- target_label: __scheme__
replacement: https
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics
并将以下 ClusterRole 绑定到 Prometheus 使用的服务帐户:
apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
name: prometheus
rules:
- apiGroups: [""]
resources:
- nodes
- nodes/proxy
- services
- endpoints
- pods
verbs: ["get", "list", "watch"]
因为 GKE 集群在 RBAC 失败的情况下仍然具有 ABAC 回退,所以我不能 100% 确定这是否涵盖了所有必需的权限。