每天日志更新时,系统都无法找到空间,尽管设备上有足够的空间;init.d 脚本可能存在错误

每天日志更新时,系统都无法找到空间,尽管设备上有足够的空间;init.d 脚本可能存在错误

我仍然对这个特定的错误感到困惑,如果能得到任何帮助或正确的推动我将非常感激。

TL;DR:许多init.d生成的脚本实例似乎正在运行,占用了服务器上的所有空间并干扰了基本功能。不确定如何纠正或进一步诊断。

完整背景:

因此,基本上,当我在一段时间后无法ssh进入远程服务器时,错误就变得明显了。基本上,密码将不再起作用。我四处查看,发现很可能是因为系统空间不足,没有空间检查密码。(我会注意到,使用 ssh-agent 并生成密钥可以以某种方式用于登录;只是不是普通的 ssh。)

syslog发现每天之后00:00:00都会打印以下内容:

Jun 21 00:00:18 SERVER-NAME-PLACEHOLDER systemd[1]: logrotate.service: Succeeded.
Jun 21 00:00:18 SERVER-NAME-PLACEHOLDER systemd[1]: Finished Rotate log files.
Jun 21 00:05:01 SERVER-NAME-PLACEHOLDER snapd[545]: autorefresh.go:528: Cannot prepare auto-refresh change due to a permanent network error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:05:01 SERVER-NAME-PLACEHOLDER snapd[545]: stateengine.go:149: state ensure error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:09:01 SERVER-NAME-PLACEHOLDER CRON[169859]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER systemd[1]: Starting Clean php session files...
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER kernel: [467998.141843] systemd-journald[170]: Failed to save stream data /run/systemd/journal/streams/8:4052447: No space left on device
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER systemd[1]: phpsessionclean.service: Succeeded.
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER systemd[1]: Finished Clean php session files.
Jun 21 00:09:03 SERVER-NAME-PLACEHOLDER systemd-networkd[489]: Failed to save network state to /run/systemd/netif/state: No space left on device
Jun 21 00:09:03 SERVER-NAME-PLACEHOLDER systemd-timesyncd[427]: Network configuration changed, trying to establish connection.
Jun 21 00:09:03 SERVER-NAME-PLACEHOLDER systemd-resolved[492]: Failed to write private resolv.conf contents: No space left on device
Jun 21 00:09:14 SERVER-NAME-PLACEHOLDER systemd-timesyncd[427]: Timed out waiting for reply from 10.5.0.5:123 (10.5.0.5).
Jun 21 00:17:01 SERVER-NAME-PLACEHOLDER CRON[169924]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jun 21 00:30:01 SERVER-NAME-PLACEHOLDER snapd[545]: autorefresh.go:528: Cannot prepare auto-refresh change due to a permanent network error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:30:01 SERVER-NAME-PLACEHOLDER snapd[545]: stateengine.go:149: state ensure error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:39:01 SERVER-NAME-PLACEHOLDER CRON[169961]: (root) CMD (  [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd-networkd[489]: Failed to save network state to /run/systemd/netif/state: No space left on device
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd-resolved[492]: Failed to write private resolv.conf contents: No space left on device
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd-timesyncd[427]: Network configuration changed, trying to establish connection.
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd[1]: Starting Clean php session files...
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER kernel: [469799.368812] systemd-journald[170]: Failed to save stream data /run/systemd/journal/streams/8:4064685: No space left on device
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd[1]: phpsessionclean.service: Succeeded.
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd[1]: Finished Clean php session files.

这似乎表明这是一个空间问题。有趣的是,在服务器重置后,一切都开始正常工作,并且完全正常。

因此我开始调查典型的罪魁祸首:

  1. df -i /
Filesystem       Inodes  IUsed    IFree IUse% Mounted on
/dev/root      19200000 779105 18420895    5% /
  1. df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/root       146G   38G  108G  27% /
devtmpfs        3.8G     0  3.8G   0% /dev
tmpfs           3.8G     0  3.8G   0% /dev/shm
tmpfs           766M  766M     0 100% /run
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           3.8G     0  3.8G   0% /sys/fs/cgroup
/dev/loop0       27M   27M     0 100% /snap/amazon-ssm-agent/5163
/dev/loop1       56M   56M     0 100% /snap/core18/2344
/dev/loop2       62M   62M     0 100% /snap/core20/1494
/dev/loop3       26M   26M     0 100% /snap/amazon-ssm-agent/5656
/dev/loop4       71M   71M     0 100% /snap/lxd/19647
/dev/loop5       45M   45M     0 100% /snap/snapd/15904
/dev/loop6       47M   47M     0 100% /snap/snapd/16010
/dev/loop7       56M   56M     0 100% /snap/core18/2409
/dev/loop8       62M   62M     0 100% /snap/core20/1518
/dev/loop9       68M   68M     0 100% /snap/lxd/22753
tmpfs           766M     0  766M   0% /run/user/0
tmpfs           766M   20K  766M   1% /run/user/126
  1. du -sh /
...
35G     /
  1. lsof / | grep deleted

[什么也没发生]

我还检查过htop

在此处输入图片描述

这似乎表明 Blazegraph 的多个用途正在运行并占用空间:

在此处输入图片描述 在此处输入图片描述

我对此有点困惑...所以我已将 Blazegraph(及其更新程序)设置init.d为文件夹中的服务,但它们应该只启动一次,对吗?我想不出它为什么会如此频繁地出现。

我的init.d样子是这样的:

#!/bin/sh
### BEGIN INIT INFO
# Provides:          blazegraph
# Required-Start:    $local_fs $network $named $time $syslog
# Required-Stop:     $local_fs $network $named $time $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: blazegraph jetty server
# Description:       Start blazegraph on a jetty server
### END INIT INFO

SCRIPT='sudo BLAZEGRAPH_OPTS="-DwikibaseConceptUri=https://placeholder.com" bash /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/runBlazegraph.sh'

PIDFILE=/var/run/blazegraph.pid

start() {
  if [ -f /var/run/$PIDNAME ] && kill -0 $(cat /var/run/$PIDNAME); then
    echo 'Service already running' >&2
    return 1
  fi
  echo 'Starting service…' >&2
  local CMD="$SCRIPT"
  su -c "$CMD" > "$PIDFILE"
  echo 'Service started' >&2
}

stop() {
  if [ ! -f "$PIDFILE" ] || ! kill -0 $(cat "$PIDFILE"); then
    echo 'Service not running' >&2
    return 1
  fi
  echo 'Stopping service…' >&2
  kill -15 $(cat "$PIDFILE") && rm -f "$PIDFILE"
  echo 'Service stopped' >&2
}

uninstall() {
  echo -n "Are you really sure you want to uninstall this service? That cannot be undone. [yes|No] "
  local SURE
  read SURE
  if [ "$SURE" = "yes" ]; then
    stop
    rm -f "$PIDFILE"
    #echo "Notice: log file is not be removed: '$LOGFILE'" >&2
    update-rc.d -f <NAME> remove
    rm -fv "$0"
  fi
}

case "$1" in
  start)
    start
    ;;
  stop)
    stop
    ;;
  uninstall)
    uninstall
    ;;
  retart)
    stop
    start
    ;;
  *)
    echo "Usage: $0 {start|stop|restart|uninstall}"
esac

此外,应用程序的一切似乎都正常工作?Blazegraph 可以更新并可查询;前端的更改显示正常。即使没有空间,这些进程似乎也能顺利运行。

有什么想法吗?我想知道这是否是我设置服务时的错误init.d?我遇到了 Wikibase 的一位项目经理,她从未见过此错误,而且 Blazegraph 似乎没有记录此错误,这让我认为这是一个服务错误?

相关内容