在我的公司,我使用 Terraform 构建了一个 K8s 集群,并使用 EFK(Elasticsearch、Fluent-bit、Kibana)配置了一个日志系统。
k8s和Elasticsearch使用AWS的EKS和Opensearch Servcie(ES 7.10),并通过Helm单独将Fluent-bit作为Daemonset安装。
https://artifacthub.io/packages/helm/fluent/fluent-bit
对于Fluent-bit Config来说,其结构如下。
config:
inputs: |
[INPUT]
Name tail
Exclude_Path /var/log/containers/cloudwatch-agent*, /var/log/containers/fluent-bit*, /var/log/containers/aws-node*, /var/log/containers/kube-proxy*, / var/log/containers/aws-load-balancer*, /var/log/containers/cert-manager*, /var/log/containers/coredns*, /var/log/containers/ebs-csi*, /var/ log/containers/grafana*, /var/log/containers/karpenter*, /var/log/containers/metrics-server*, /var/log/containers/prometheus*
Path /var/log/containers/*.log
Tag kube.*
Mem_Buf_Limit 50MB
multiline.parser python-custom
outputs: |
[OUTPUT]
Name es
Match kube.*
Host ${es_domain_name}
Port 443
TLS On
AWS_Auth On
AWS_Region ${region}
Retry_Limit 6
Replace_Dots On
Trace_Error On
filters: |
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Merge_Log_Key log_processed
Keep_LogOff
K8S-Logging.Parser On
K8S-Logging.Exclude On
customParsers: |
[MULTILINE_PARSER]
name python-custom
type regex
flush_timeout 1000
rule "start_state" "/stderr F(.*)/" "cont"
rule "cont" "/stderr F(.*)/" "cont"
总体来说它运行良好,但是有一天,在某种情况之后,我注意到 Fluent-bit 收集的所有日志都指向相同的时间戳。
以前,Pod输出的日志是使用准确的时间戳记录的,但是在Pod重新加载后,日志仍然使用固定的@timestamp记录,而且由于日志是在过去时区积累的,所以在kibana中查看时会出现以下现象。
日志都是实时传输到ES的,但是由于都是在过去某个固定时间采集的,所以无法执行正常的查询。
// Unlike the timestamp output in the actual application, the timestamp recorded in Fluent-bit remains the same.
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:48:55...[2023-10-16 09:48:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:53:55...[2023-10-16 09:53:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:03...[2023-10-16 09:58:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T09:58:55...[2023-10-16 09:58:55]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:03...[2023-10-16 10:03:03]...
Oct 16, 2023 @ 16:13:36.142 2023-10-16T10:03:55...[2023-10-16 10:03:55]...
这个问题在更新并重新加载每个服务正在运行的Pod时出现。我确认新创建的Pod的inotify_fs_add日志输出到Fluent-bit,无法确认其他错误日志。
Pod 重新加载
│ backend-api-deployment-5d9fc55496-8mppx ● 1/1 0 Running
│ backend-api-deployment-57d68dd4f5-djn8x ● 1/1 0 Terminating
Fluent-bit 日志
│ Fluent Bit v2.1.8
│ * Copyright (C) 2015-2022 The Fluent Bit Authors
│ * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
│ * https://fluentbit.io
│
│ [2023/10/16 16:05:00] [ info] [fluent bit] version=2.1.8, commit=1d83649441, pid=1
│ [2023/10/16 16:05:00] [ info] [storage] ver=1.4.0, type=memory, sync=normal, checksum=off, max_chunks_up=128
│ [2023/10/16 16:05:00] [ info] [cmetrics] version=0.6.3
│ [2023/10/16 16:05:00] [ info] [ctraces ] version=0.3.1
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] initializing
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] storage_strategy='memory' (memory only)
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] multiline core started
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc port=443
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] token updated
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] local POD info OK
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...
│ [2023/10/16 16:05:00] [ info] [filter:kubernetes:kubernetes.0] connectivity OK
│ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #0 started
│ [2023/10/16 16:05:00] [ info] [output:es:es.0] worker #1 started
│ [2023/10/16 16:05:00] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
│ [2023/10/16 16:05:00] [ info] [sp] stream processor started
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=41916664 watch_fd=1 name=/var/log/containers/backend-api-deployment-57d68dd4f5-djn8x_backend-dev_backend-api-f950104ab9f300586803316ad7beef5c9506db9fceeb4af0dc0575b3b9bacfd7.log
│ [2023/10/16 16:05:00] [ info] [input:tail:tail.0] inotify_fs_add(): inode=17488499 watch_fd=2 name=/var/log/containers/backend-api-deployment-6dc5b97645-tqwxh_backend-staging_backend-api-23f14f74af6cc163f233499ba92a2746cb2f263c9379ba53fc47cd3d76e00bbb.log
...
│ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response:
│ 429 Too Many Requests /_bulk
│
│ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.870529031.flb', retry in 6 seconds: task_id=1, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 04:49:01] [error] [output:es:es.0] HTTP status=429 URI=/_bulk, response:
│ 429 Too Many Requests /_bulk
│
│ [2023/10/17 04:49:01] [ warn] [engine] failed to flush chunk '1-1697518140.276173730.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 04:49:07] [ info] [engine] flush chunk '1-1697518140.870529031.flb' succeeded at retry 1: task_id=1, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 04:49:08] [ info] [engine] flush chunk '1-1697518140.276173730.flb' succeeded at retry 1: task_id=0, input=tail.0 > output=es.0 (out_id=0)
│ [2023/10/17 05:08:50] [ info] [input:tail:tail.0] inotify_fs_add(): inode=37547666 watch_fd=8 name=/var/log/containers/backend-api-deployment-5d9fc55496-8mppx_backend-dev_backend-api-839c128d6227f797722496a992ca9149b95c366997c2b1ec28ceb65490582e3b.log
│ [2023/10/17 05:08:52] [ info] [filter:kubernetes:kubernetes.0] token updated
│ [2023/10/17 05:09:25] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=41916664 watch_fd=1
│ [2023/10/17 05:09:37] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=17488499 watch_fd=2
有趣的是如果所有 Fluent-bit Daemon Set 都被手动杀死,并通过自我修复恢复,它们都将暂时恢复正常收集。
此后,如果再次更新并重新加载 Service Pod,同样的问题会再次出现。
是什么原因导致了这个问题?我该如何解决它?如果有任何推荐的 Fluent-bit 设置,欢迎提供更多建议。
谢谢你!
答案1
我找不到明确的原因,但我认为我找到了解决方案,所以我与你分享。
在 Kibana 中检查日志时,记录了各种时间戳,但是如附图所示,Fluent-bit 直接记录的时间戳存在问题。
然而,考虑到Fluent-bit一旦重新启动Pod,一切都会恢复正常,很难认为我的Config是错误的。
因此,我考虑从我的应用程序日志中解析时间信息,该日志继续输出正常时间,并直接指定时间,而不是使用@timestamp
Fluent-bit 记录的默认值。
我声明了一个解析器,如下所示,使用解析时间信息的正则表达式从日志中的字符串中提取值,然后使用它Time_Key
。
[PARSER]
Name log_timestamp
Format regex
Regex ^(?<log_time>[^\s]+)
Time_Key log_time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
Time_Keep On
上述解析器是使用过滤管道中的解析器过滤器进行操作的。
[FILTER]
Name parser
Match kube.*
Key_Name log
Parser log_timestamp
Preserve_Key On
Reserve_Data On
换句话说,完整配置如下。
config:
inputs: |
[INPUT]
Name tail
Path /var/log/containers/*.log
Tag kube.*
Mem_Buf_Limit 50MB
Refresh_Interval 10
multiline.parser python-custom
outputs: |
[OUTPUT]
Name es
Match kube.*
Host ${es_domain_name}
Port 443
TLS On
AWS_Auth On
AWS_Region ${region}
Retry_Limit 6
Replace_Dots On
Trace_Error On
filters: |
[FILTER]
Name kubernetes
Match kube.*
Merge_Log On
Merge_Log_Key log_processed
Keep_Log Off
K8S-Logging.Parser On
K8S-Logging.Exclude On
[FILTER]
Name parser
Match kube.*
Key_Name log
Parser log_timestamp
Preserve_Key On
Reserve_Data On
customParsers: |
[MULTILINE_PARSER]
name python-custom
type regex
flush_timeout 1000
rule "start_state" "/stderr F(.*)/" "cont"
rule "cont" "/stderr F(.*)/" "cont"
[PARSER]
Name log_timestamp
Format regex
Regex ^(?<log_time>[^\s]+)
Time_Key log_time
Time_Format %Y-%m-%dT%H:%M:%S.%LZ
Time_Keep On