我正在尝试解决在 Google Cloud Kubernetes Engine 上发生的问题。
问题简而言之:当我通过 15-20MB 的 PHP 应用程序上传文件时,nginx 入口控制器崩溃,磁盘 IO 迅速上升,然后 CPU 上升,大约需要 5-30 分钟,直到 IO 和 CPU 下降并全部成功重新启动。
以下是来自 nginx-ingress-controller 容器的日志,记录了与我的评论相关的所有情况:
应用程序中成功接收上传:
INFO 2020-02-14 14:30:55.481 CET 10.102.1.1 - [10.102.1.1] - - [14/Feb/2020:13:30:55 +0000] "POST /api/v1/contracts/38141/file-system/upload HTTP/2.0" 499 0
NGINX 开始生成大量如下日志:
INFO 2020-02-14 14:30:55.819 CET *�I�g�*��\u001AnK67�@?+�(%u052f��O�yqq$+u$,�b�<*�9#\t��\u0003d\u0006+����I�]A�%u0110jv��hAp\"�63�9\u0019Q�{�x|K�\u000BE\u001C��\"-P%u0079�\u001Ed�Tv
许多行之后都有关于入口端点不可用的日志:
WARN 2020-02-14T13:31:05.505984Z Service "gitlab-managed-apps/ingress-nginx-ingress-default-backend" does not have any active Endpoint
WARN 2020-02-14 14:31:05.526 CET Service "my-app/my-app" does not have any active Endpoint.
WARN 2020-02-14 14:31:05.526 CET Service "my-app/app-staging" does not have any active Endpoint.
...跳过访问日志...
WARN 2020-02-14 14:32:34.419 CET failed to renew lease gitlab-managed-apps/ingress-controller-leader-nginx: failed to tryAcquireOrRenew context deadline exceeded
2020-02-14 14:32:42.227 CET attempting to acquire leader lease gitlab-managed-apps/ingress-controller-leader-nginx...
ERROR 2020-02-14 14:32:43.464 CET Failed to update lock: Operation cannot be fulfilled on configmaps "ingress-controller-leader-nginx": the object has been modified; please apply your changes to the latest version and try again
现在客户端正在上传另一个文件,并再次生成大量符号日志...在此符号日志之后记录了以下内容:
INFO 2020-02-14T13:33:37.525466Z Received SIGTERM, shutting down
INFO 2020-02-14T13:33:55.513100Z Received SIGTERM, shutting down
INFO 2020-02-14T13:33:55.513155Z Shutting down controller queues
INFO 2020-02-14T13:33:55.516017Z updating status of Ingress rules (remove)
ERROR 2020-02-14T13:33:55.570340Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout
INFO 2020-02-14T13:33:55.574690Z Shutting down controller queues
INFO 2020-02-14T13:33:55.576049Z updating status of Ingress rules (remove)
ERROR 2020-02-14T13:33:55.610722Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout
ERROR 2020-02-14T13:33:55.774881Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout
INFO 2020-02-14T13:33:55.776321Z failed to renew lease gitlab-managed-apps/ingress-controller-leader-nginx: failed to tryAcquireOrRenew context deadline exceeded
INFO 2020-02-14T13:33:55.781376Z attempting to acquire leader lease gitlab-managed-apps/ingress-controller-leader-nginx...
INFO 2020-02-14T13:33:56.826124Z successfully acquired lease gitlab-managed-apps/ingress-controller-leader-nginx
INFO 2020-02-14T13:33:56.833827Z new leader elected: ingress-nginx-ingress-controller-756f8d9cbb-86xnh
ERROR 2020-02-14T13:33:56.933107Z queue has been shutdown, failed to enqueue: &ObjectMeta{Name:sync status,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[],Finalizers:[],ClusterName:,ManagedFields:[],}
INFO 2020-02-14T13:33:58.027600Z new leader elected: ingress-nginx-ingress-controller-756f8d9cbb-86xnh
ERROR 2020-02-14T13:33:58.117920Z Failed to update lock: Operation cannot be fulfilled on configmaps "ingress-controller-leader-nginx": the object has been modified; please apply your changes to the latest version and try again
INFO 2020-02-14T13:33:59.709458Z Stopping NGINX process
INFO 2020-02-14T13:33:59.718181Z Stopping NGINX process
ERROR 2020-02-14T13:34:03.010148Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: dial unix /tmp/nginx-status-server.sock: i/o timeout
ERROR 2020-02-14T13:34:12.627155Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: read unix @->/tmp/nginx-status-server.sock: i/o timeout
ERROR 2020-02-14T13:34:12.832624Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: read unix @->/tmp/nginx-status-server.sock: i/o timeout
ERROR 2020-02-14T13:34:13.693853Z healthcheck error: Get http+unix://nginx-status/healthz: read unix @->/tmp/nginx-status-server.sock: i/o timeout
ERROR 2020-02-14T13:34:13.693930Z healthcheck error: Get http+unix://nginx-status/is-dynamic-lb-initialized: read unix @->/tmp/nginx-status-server.sock: i/o timeout
INFO 2020-02-14T13:34:41.620594055Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.620664183Z NGINX Ingress controller
INFO 2020-02-14T13:34:41.620671154Z Release: 0.25.1
INFO 2020-02-14T13:34:41.620675964Z Build: git-5179893a9
INFO 2020-02-14T13:34:41.620681055Z Repository: https://github.com/kubernetes/ingress-nginx/
INFO 2020-02-14T13:34:41.620686042Z nginx version: openresty/1.15.8.1
INFO 2020-02-14T13:34:41.620691348Z
INFO 2020-02-14T13:34:41.620695778Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.620701128Z
INFO 2020-02-14T13:34:41.622564Z Watching for Ingress class: nginx
WARN 2020-02-14T13:34:41.622863Z SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
INFO 2020-02-14T13:34:41.623360607Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.623418446Z NGINX Ingress controller
INFO 2020-02-14T13:34:41.623425256Z Release: 0.25.1
INFO 2020-02-14T13:34:41.623426Z Watching for Ingress class: nginx
INFO 2020-02-14T13:34:41.623430244Z Build: git-5179893a9
INFO 2020-02-14T13:34:41.623435128Z Repository: https://github.com/kubernetes/ingress-nginx/
INFO 2020-02-14T13:34:41.623441533Z nginx version: openresty/1.15.8.1
INFO 2020-02-14T13:34:41.623447006Z
INFO 2020-02-14T13:34:41.623451329Z -------------------------------------------------------------------------------
INFO 2020-02-14T13:34:41.623456382Z
WARN 2020-02-14T13:34:41.623731Z SSL certificate chain completion is disabled (--enable-ssl-chain-completion=false)
ERROR 2020-02-14T13:34:41.629507140Z nginx version: openresty/1.15.8.1
WARN 2020-02-14T13:34:41.633116Z Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
INFO 2020-02-14T13:34:41.633644Z Creating API client for https://10.103.0.1:443
ERROR 2020-02-14T13:34:41.640959117Z nginx version: openresty/1.15.8.1
WARN 2020-02-14T13:34:41.642065Z Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
INFO 2020-02-14T13:34:41.642376Z Creating API client for https://10.103.0.1:443
INFO 2020-02-14T13:34:41.682018Z Running in Kubernetes cluster version v1.13+ (v1.13.12-gke.25) - git (clean) commit 654de8cac69f1fc5db6f2de0b88d6d027bc15828 - platform linux/amd64
INFO 2020-02-14T13:34:41.700374Z Running in Kubernetes cluster version v1.13+ (v1.13.12-gke.25) - git (clean) commit 654de8cac69f1fc5db6f2de0b88d6d027bc15828 - platform linux/amd64
可以看到 nginx(我不知道为什么)崩溃并重新启动了。
我的问题是:
nginx 的健康检查失败并且 pod 被终止会发生什么?我可以以某种方式配置 nginx-ingress 的缓冲以避免这种情况发生吗?是否因为大量日志记录和磁盘故障而发生这种情况?还是因为它在 nginx 中缓冲上传的文件并且响应健康检查需要太多时间?如何避免它?
这是我已经尝试过的 nginx-ingress 注释,但是有这些注释或没有这些注释它都不起作用:
nginx.ingress.kubernetes.io/client-body-buffer-size: 5m
nginx.ingress.kubernetes.io/proxy-body-size: 15m
nginx.ingress.kubernetes.io/proxy-buffering: "on"
nginx.org/client-max-body-size: 15m
技术和版本:
Kubernetes 主版本 1.13.12-gke.25
节点 1.13.11-gke.14
Nginx-ingress-controller 0.25.1
感谢您的帮助,因为我不知道还能尝试什么。
答案1
答案2
看来我已经解决了这个问题。Nginx-ingress 还包括 modsecurity (WAF),它启用了许多规则。禁用 modsecurity 后,大量日志消失了,到目前为止,它似乎可以正常工作。现在我可以一次成功上传 20 次 30MB 文件,日志和磁盘 I/O 没有任何问题。如果它真的可以长期正常工作而没有任何问题,我将在本周末更新这个答案。