我们正在使用 AWS Cloudwatch 来监控 CPU 使用率、API 调用的 p99 延迟等。问题是在高峰流量期间,Amazon Cloudwatch Agent 本身的 CPU 使用率为 25%-35%,因此在很大程度上导致了高 CPU 使用率触发。我观察到 p99 延迟指标与 CPU 使用率指标之间存在直接关联。
- 监控工具占用大量系统资源是正常的吗?
- 有没有办法优化 Amazon Cloudwatch Agent 以利用低系统资源?
我在这里粘贴了 Amazon Cloudwatch 的配置文件:
[agent]
collection_jitter = "0s"
debug = false
flush_interval = "1s"
flush_jitter = "0s"
hostname = ""
interval = "60s"
logfile = "/opt/aws/amazon-cloudwatch-agent/logs/amazon-cloudwatch-agent.log"
logtarget = "lumberjack"
metric_batch_size = 1000
metric_buffer_limit = 10000
omit_hostname = false
precision = ""
quiet = false
round_interval = false
[inputs]
[[inputs.cpu]]
fieldpass = ["usage_active"]
interval = "10s"
percpu = true
report_active = true
totalcpu = false
[inputs.cpu.tags]
"aws:StorageResolution" = "true"
metricPath = "metrics"
[[inputs.disk]]
fieldpass = ["total", "used"]
interval = "60s"
mount_points = ["/", "/tmp"]
tagexclude = ["mode"]
[inputs.disk.tags]
metricPath = "metrics"
[[inputs.logfile]]
destination = "cloudwatchlogs"
file_state_folder = "/opt/aws/amazon-cloudwatch-agent/logs/state"
[[inputs.logfile.file_config]]
file_path = "/home/ubuntu/access-logs-app2/app.log.*"
from_beginning = true
log_group_name = "access-logs-app2"
log_stream_name = "access-logs-app2"
pipe = false
[[inputs.logfile.file_config]]
file_path = "/home/ubuntu/webhooks-logs-app2/webhook.log.*"
from_beginning = true
log_group_name = "webhooks-logs-app2"
log_stream_name = "webhooks-logs-app2"
pipe = false
[[inputs.logfile.file_config]]
file_path = "/home/ubuntu/access-logs-app/app.log.*"
from_beginning = true
log_group_name = "access-logs-app"
log_stream_name = "access-logs-app"
pipe = false
[[inputs.logfile.file_config]]
file_path = "/home/ubuntu/webhooks-logs-app/webhook.log.*"
from_beginning = true
log_group_name = "webhooks-logs-app"
log_stream_name = "webhooks-logs-app"
pipe = false
[[inputs.logfile.file_config]]
file_path = "/home/ubuntu/query-logs/**"
from_beginning = true
log_group_name = "db-query-logs"
log_stream_name = "db-query-logs"
pipe = false
[[inputs.logfile.file_config]]
file_path = "/var/log/nginx/some_name.*"
from_beginning = true
log_group_name = "some_name-nginx"
log_stream_name = "some_name-nginx"
pipe = false
[inputs.logfile.tags]
metricPath = "logs"
[[inputs.mem]]
fieldpass = ["used", "cached", "total"]
interval = "60s"
[inputs.mem.tags]
metricPath = "metrics"
[outputs]
[[outputs.cloudwatch]]
force_flush_interval = "60s"
namespace = "CWAgent"
profile = "www-data"
region = "ap-south-1"
shared_credential_file = "/var/.aws/credentials"
tagexclude = ["metricPath"]
[outputs.cloudwatch.tagpass]
metricPath = ["metrics"]
[[outputs.cloudwatchlogs]]
force_flush_interval = "5s"
log_stream_name = "production"
profile = "www-data"
region = "ap-south-1"
shared_credential_file = "/var/.aws/credentials"
tagexclude = ["metricPath"]
[outputs.cloudwatchlogs.tagpass]
metricPath = ["logs"]
答案1
我有同样的问题,您已经为我解答了。我运行一个邮件服务器、一个 DNS 服务器和一个 Web 服务器(一个单独的 RDS 数据库实例的前端)。我曾经在一个 t2.nano 实例(不是 CPU 强机!)上运行所有这些,毫不费力(将 CPU 信用余额锁定在 72 而没有任何偏差)。
然后我将以下四行添加到每分钟运行一次的 cron 作业中(每行都有不同的指标名称):
aws cloudwatch ... --value $(($(df --output=avail / | tail -1)*1024))
aws cloudwatch ... --value $(($(df --output=avail /home | tail -1)*1024))
aws cloudwatch ... --value $(free -b | sed -r 's:Mem([^0-9]*([0-9]*)){6}.*:\2:p;d')
aws cloudwatch ... --value $(free -b | sed -r 's:Swap([^0-9]*([0-9]*)){2}.*:\2:p;d')
这导致我的 CPU 信用余额持续减少,所以我将 cron 间隔改为五分钟,这样就稳定了我的信用余额,没有明显进一步减少或增加。这太荒谬了!
最终的解决方案是什么?我认为是时候升级到 t3.nano 实例(两个 vCPU 而不是一个)了,我做到了。现在,替换的 cron 作业(见下文)每分钟运行一次,它以每小时 5 个的速度累积 CPU 积分。通过计算每分钟运行一次的第一个 cron 作业文件,可以得出每个 aws cloudwatch 语句每小时 0.4 个 CPU 积分的速率。
看来您可以在一个 aws cloudwatch 语句中组合发送多个指标,该语句与上面的一个语句同时执行:
{ cat <<EOF
[
{
"MetricName": "EC2 root",
"Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
"Value": $(($(df --output=avail / | tail -1)*1024)),
"Unit": "Bytes"
},
{
"MetricName": "EC2 home",
"Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
"Value": $(($(df --output=avail /home | tail -1)*1024)),
"Unit": "Bytes"
},
{
"MetricName": "EC2 free",
"Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
"Value": $(free -b | sed -r 's:Mem([^0-9]*([0-9]*)){6}.*:\2:p;d'),
"Unit": "Bytes"
},
{
"MetricName": "EC2 swap",
"Dimensions": [ { "Name": "Instance", "Value": "i-instance-id" } ],
"Value": $(free -b | sed -r 's:Swap([^0-9]*([0-9]*)){2}.*:\2:p;d'),
"Unit": "Bytes"
}
]
EOF
} | aws cloudwatch put-metric-data --namespace MySpace --metric-data file:///dev/stdin
[请注意,“heredoc”语法的使用允许在“文本”文件中评估表达式。]
谁知道 CloudWatch 代理在做什么。我来这里是想看看运行 CloudWatch 代理是否比使用单独的 aws cloudwatch 语句更有效。显然不是。