rsyslog重启后批量发送日志到目的地

2024-6-2 • tag-icon

我有大约 70 台服务器将日志发送到纸质记录使用 Rsyslog。

9 月 20 日，Papertrail 遇到问题，我们的大多数服务器都记录了以下消息：

Sep 20 11:42:30 server-name rsyslogd[7400]: unexpected GnuTLS error -53 - this could be caused by a broken connection. GnuTLS reports: Error in the push function.   [v8.32.0 try http://www.rsyslog.com/e/2078 ]
Sep 20 11:42:30 server-name rsyslogd[7400]: omfwd: TCPSendBuf error -2078, destruct TCP Connection to logs.papertrailapp.com:xxxxx [v8.32.0 try http://www.rsyslog.com/e/2078 ]
Sep 20 11:42:30 server-name rsyslogd[7400]: action 'action 7' suspended (module 'builtin:omfwd'), retry 0. There should be messages before this one giving the reason for suspension. [v8.32.0 try http://www.rsyslog.com/e/2007 ]
Sep 20 11:42:43 server-name rsyslogd[7400]: action 'action 7' resumed (module 'builtin:omfwd') [v8.32.0 try http://www.rsyslog.com/e/2359 ]

但是，其中 3 台服务器没有记录最后一行action 'action 7' resumed (module 'builtin:omfwd')。

从那时起，这些服务器就开始批量向 Papertrail 发送延迟日志，我们可以看到在速度图上。

其中两个发送了约 750 行批次，最后一个发送了约 1500 行批次。

我们所有的服务器都使用 Ansible 部署，配置相同。除了这部分，大部分 rsyslog 配置都是默认的：

$ActionResumeInterval 10
$ActionQueueSize 100000
$ActionQueueDiscardMark 97500
$ActionQueueHighWaterMark 80000
$ActionQueueType LinkedList
$ActionQueueFileName papertrailqueue
$ActionQueueCheckpointInterval 100
$ActionQueueMaxDiskSpace 2g
$ActionResumeRetryCount -1
$ActionQueueSaveOnShutdown on
$ActionQueueTimeoutEnqueue 2
$ActionQueueDiscardSeverity 0

重新启动 rsyslog 服务可以解决问题，但我想防止这种情况发生，有人遇到过这种情况吗？

谢谢！

相关内容