Cloudwatch 代理 CPU 使用率过高

Cloudwatch 代理 CPU 使用率过高

我们正在使用 AWS Cloudwatch 来监控 CPU 使用率、API 调用的 p99 延迟等。问题是在高峰流量期间,Amazon Cloudwatch Agent 本身的 CPU 使用率为 25%-35%,因此在很大程度上导致了高 CPU 使用率触发。我观察到 p99 延迟指标与 CPU 使用率指标之间存在直接关联。

  1. 监控工具占用大量系统资源是正常的吗?
  2. 有没有办法优化 Amazon Cloudwatch Agent 以利用低系统资源?

我在这里粘贴了 Amazon Cloudwatch 的配置文件:

[agent]
  collection_jitter = "0s"
  debug = false
  flush_interval = "1s"
  flush_jitter = "0s"
  hostname = ""
  interval = "60s"
  logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
  logtarget = "lumberjack"
  metric_batch_size = 1000
  metric_buffer_limit = 10000
  omit_hostname = false
  precision = ""
  quiet = false
  round_interval = false

[inputs]

  [[inputs.cpu]]
    fieldpass = ["usage_active"]
    interval = "10s"
    percpu = true
    report_active = true
    totalcpu = false
    [inputs.cpu.tags]
      "aws:StorageResolution" = "true"
      metricPath = "metrics"

  [[inputs.disk]]
    fieldpass = ["total", "used"]
    interval = "60s"
    mount_points = ["/", "/tmp"]
    tagexclude = ["mode"]
    [inputs.disk.tags]
      metricPath = "metrics"

  [[inputs.logfile]]
    destination = "cloudwatchlogs"
    file_state_folder = "/opt/aws/amazon-cloudwatch-agent/logs/state"

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/access-logs-app2/app.log.*"
      from_beginning = true
      log_group_name = "access-logs-app2"
      log_stream_name = "access-logs-app2"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/webhooks-logs-app2/webhook.log.*"
      from_beginning = true
      log_group_name = "webhooks-logs-app2"
      log_stream_name = "webhooks-logs-app2"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/access-logs-app/app.log.*"
      from_beginning = true
      log_group_name = "access-logs-app"
      log_stream_name = "access-logs-app"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/webhooks-logs-app/webhook.log.*"
      from_beginning = true
      log_group_name = "webhooks-logs-app"
      log_stream_name = "webhooks-logs-app"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/home/ubuntu/query-logs/**"
      from_beginning = true
      log_group_name = "db-query-logs"
      log_stream_name = "db-query-logs"
      pipe = false

    [[inputs.logfile.file_config]]
      file_path = "/var/log/nginx/some_name.*"
      from_beginning = true
      log_group_name = "some_name-nginx"
      log_stream_name = "some_name-nginx"
      pipe = false
    [inputs.logfile.tags]
      metricPath = "logs"

  [[inputs.mem]]
    fieldpass = ["used", "cached", "total"]
    interval = "60s"
    [inputs.mem.tags]
      metricPath = "metrics"

[outputs]

  [[outputs.cloudwatch]]
    force_flush_interval = "60s"
    namespace = "CWAgent"
    profile = "www-data"
    region = "ap-south-1"
    shared_credential_file = "/var/.aws/credentials"
    tagexclude = ["metricPath"]
    [outputs.cloudwatch.tagpass]
      metricPath = ["metrics"]

  [[outputs.cloudwatchlogs]]
    force_flush_interval = "5s"
    log_stream_name = "production"
    profile = "www-data"
    region = "ap-south-1"
    shared_credential_file = "/var/.aws/credentials"
    tagexclude = ["metricPath"]
    [outputs.cloudwatchlogs.tagpass]
      metricPath = ["logs"]

答案1

我有同样的问题,您已经为我解答了。我运行一个邮件服务器、一个 DNS 服务器和一个 Web 服务器(一个单独的 RDS 数据库实例的前端)。我曾经在一个 t2.nano 实例(不是 CPU 强机!)上运行所有这些,毫不费力(将 CPU 信用余额锁定在 72 而没有任何偏差)。

然后我将以下四行添加到每分钟运行一次的 cron 作业中(每行都有不同的指标名称):

aws cloudwatch ... --value $(($(df --output=avail /     | tail -1)*1024))
aws cloudwatch ... --value $(($(df --output=avail /home | tail -1)*1024))
aws cloudwatch ... --value $(free -b | sed -r  's:Mem([^0-9]*([0-9]*)){6}.*:\2:p;d')
aws cloudwatch ... --value $(free -b | sed -r 's:Swap([^0-9]*([0-9]*)){2}.*:\2:p;d')

这导致我的 CPU 信用余额持续减少,所以我将 cron 间隔改为五分钟,这样就稳定了我的信用余额,没有明显进一步减少或增加。这太荒谬了!

最终的解决方案是什么?我认为是时候升级到 t3.nano 实例(两个 vCPU 而不是一个)了,我做到了。现在,替换的 cron 作业(见下文)每分钟运行一次,它以每小时 5 个的速度累积 CPU 积分。通过计算每分钟运行一次的第一个 cron 作业文件,可以得出每个 aws cloudwatch 语句每小时 0.4 个 CPU 积分的速率。

看来您可以在一个 aws cloudwatch 语句中组合发送多个指标,该语句与上面的一个语句同时执行:

{ cat <<EOF
[
 {
  "MetricName": "EC2 root",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(($(df --output=avail /     | tail -1)*1024)),
  "Unit":       "Bytes"
 },
 {
  "MetricName": "EC2 home",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(($(df --output=avail /home | tail -1)*1024)),
  "Unit":       "Bytes"
 },
 {
  "MetricName": "EC2 free",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(free -b | sed -r  's:Mem([^0-9]*([0-9]*)){6}.*:\2:p;d'),
  "Unit":       "Bytes"
 },
 {
  "MetricName": "EC2 swap",
  "Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
  "Value":      $(free -b | sed -r 's:Swap([^0-9]*([0-9]*)){2}.*:\2:p;d'),
  "Unit":       "Bytes"
 }
]
EOF
} | aws cloudwatch put-metric-data --namespace MySpace --metric-data file:///dev/stdin

[请注意,“heredoc”语法的使用允许在“文本”文件中评估表达式。]

谁知道 CloudWatch 代理在做什么。我来这里是想看看运行 CloudWatch 代理是否比使用单独的 aws cloudwatch 语句更有效。显然不是。

相关内容