我们收到了此 Netdata 警报
system.softnet_stat 在过去 10 分钟内 ksoftirq 用尽 sysctl net.core.netdev_budget 或 net.core.netdev_budget_usecs 的次数,剩余工作(这可能是丢包的原因)
我一直在寻找有关如何解决此问题的信息。每个人都建议增加netdev_budget和/或netdev_budget_usecs但许多资料对如何设定限制意见不一。有些资料建议我们应该增加netdev_budget大约 30K 个事件,有些甚至达到 600 个事件。我们的配置/etc/sysctl.conf所有内容都已注释掉吗,我猜所有值都是默认值?
我们每天的平均事件数量为 10K-20K。系统.softnet_stat图表中我们可以看到,即使处理的事件数只有 2K,也存在挤压事件的情况。
简而言之,我们如何计算应该赋予netdev_budget和/或netdev_budget_usecs?
答案1
这个问题没有一劳永逸的答案。一般来说,你应该在 sysctl.conf 中设置更高的值,直到你找到可行的值;但也有可能机器接收的数据包比它能处理的要多,所以可能没有可行的值。基于https://github.com/netdata/netdata/issues/1076和https://nateware.com/2013/04/06/linux-network-tuning-for-2013/以下是用户报告可以正常工作的示例配置:
# /etc/sysctl.d/99-network-tuning.conf
# http://www.nateware.com/linux-network-tuning-for-2013.html
# Increase Linux autotuning TCP buffer limits
# Set max to 16MB for 1GE and 32M (33554432) or 54M (56623104) for 10GE
# Don't set tcp_mem itself! Let the kernel scale it based on RAM.
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 16777216
net.core.wmem_default = 16777216
net.core.optmem_max = 40960
# cloudflare uses this for balancing latency and throughput
# https://blog.cloudflare.com/the-story-of-one-latency-spike/
## net.ipv4.tcp_rmem = 4096 1048576 2097152
net.ipv4.tcp_rmem = 4096 5242880 33554432
net.ipv4.tcp_wmem = 4096 65536 16777216
# Also increase the max packet backlog
net.core.netdev_max_backlog = 100000
## net.core.netdev_budget = 50000
net.core.netdev_budget = 60000
net.core.netdev_budget_usecs = 6000
# Make room for more TIME_WAIT sockets due to more clients,
# and allow them to be reused if we run out of sockets
net.ipv4.tcp_max_syn_backlog = 30000
net.ipv4.tcp_max_tw_buckets = 2000000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 10
# Disable TCP slow start on idle connections
net.ipv4.tcp_slow_start_after_idle = 0
# If your servers talk UDP, also up these limits
net.ipv4.udp_rmem_min = 8192
net.ipv4.udp_wmem_min = 8192