我仍然对这个特定的错误感到困惑,如果能得到任何帮助或正确的推动我将非常感激。
TL;DR:许多init.d
生成的脚本实例似乎正在运行,占用了服务器上的所有空间并干扰了基本功能。不确定如何纠正或进一步诊断。
完整背景:
因此,基本上,当我在一段时间后无法ssh
进入远程服务器时,错误就变得明显了。基本上,密码将不再起作用。我四处查看,发现很可能是因为系统空间不足,没有空间检查密码。(我会注意到,使用 ssh-agent 并生成密钥可以以某种方式用于登录;只是不是普通的 ssh。)
我syslog
发现每天之后00:00:00
都会打印以下内容:
Jun 21 00:00:18 SERVER-NAME-PLACEHOLDER systemd[1]: logrotate.service: Succeeded.
Jun 21 00:00:18 SERVER-NAME-PLACEHOLDER systemd[1]: Finished Rotate log files.
Jun 21 00:05:01 SERVER-NAME-PLACEHOLDER snapd[545]: autorefresh.go:528: Cannot prepare auto-refresh change due to a permanent network error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:05:01 SERVER-NAME-PLACEHOLDER snapd[545]: stateengine.go:149: state ensure error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:09:01 SERVER-NAME-PLACEHOLDER CRON[169859]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER systemd[1]: Starting Clean php session files...
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER kernel: [467998.141843] systemd-journald[170]: Failed to save stream data /run/systemd/journal/streams/8:4052447: No space left on device
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER systemd[1]: phpsessionclean.service: Succeeded.
Jun 21 00:09:02 SERVER-NAME-PLACEHOLDER systemd[1]: Finished Clean php session files.
Jun 21 00:09:03 SERVER-NAME-PLACEHOLDER systemd-networkd[489]: Failed to save network state to /run/systemd/netif/state: No space left on device
Jun 21 00:09:03 SERVER-NAME-PLACEHOLDER systemd-timesyncd[427]: Network configuration changed, trying to establish connection.
Jun 21 00:09:03 SERVER-NAME-PLACEHOLDER systemd-resolved[492]: Failed to write private resolv.conf contents: No space left on device
Jun 21 00:09:14 SERVER-NAME-PLACEHOLDER systemd-timesyncd[427]: Timed out waiting for reply from 10.5.0.5:123 (10.5.0.5).
Jun 21 00:17:01 SERVER-NAME-PLACEHOLDER CRON[169924]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Jun 21 00:30:01 SERVER-NAME-PLACEHOLDER snapd[545]: autorefresh.go:528: Cannot prepare auto-refresh change due to a permanent network error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:30:01 SERVER-NAME-PLACEHOLDER snapd[545]: stateengine.go:149: state ensure error: persistent network error: Post https://api.snapcraft.io/v2/snaps/refresh: dial tcp: lookup api.snapcraft.io: Temporary failure in name resolution
Jun 21 00:39:01 SERVER-NAME-PLACEHOLDER CRON[169961]: (root) CMD ( [ -x /usr/lib/php/sessionclean ] && if [ ! -d /run/systemd/system ]; then /usr/lib/php/sessionclean; fi)
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd-networkd[489]: Failed to save network state to /run/systemd/netif/state: No space left on device
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd-resolved[492]: Failed to write private resolv.conf contents: No space left on device
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd-timesyncd[427]: Network configuration changed, trying to establish connection.
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd[1]: Starting Clean php session files...
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER kernel: [469799.368812] systemd-journald[170]: Failed to save stream data /run/systemd/journal/streams/8:4064685: No space left on device
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd[1]: phpsessionclean.service: Succeeded.
Jun 21 00:39:04 SERVER-NAME-PLACEHOLDER systemd[1]: Finished Clean php session files.
这似乎表明这是一个空间问题。有趣的是,在服务器重置后,一切都开始正常工作,并且完全正常。
因此我开始调查典型的罪魁祸首:
df -i /
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/root 19200000 779105 18420895 5% /
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/root 146G 38G 108G 27% /
devtmpfs 3.8G 0 3.8G 0% /dev
tmpfs 3.8G 0 3.8G 0% /dev/shm
tmpfs 766M 766M 0 100% /run
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 3.8G 0 3.8G 0% /sys/fs/cgroup
/dev/loop0 27M 27M 0 100% /snap/amazon-ssm-agent/5163
/dev/loop1 56M 56M 0 100% /snap/core18/2344
/dev/loop2 62M 62M 0 100% /snap/core20/1494
/dev/loop3 26M 26M 0 100% /snap/amazon-ssm-agent/5656
/dev/loop4 71M 71M 0 100% /snap/lxd/19647
/dev/loop5 45M 45M 0 100% /snap/snapd/15904
/dev/loop6 47M 47M 0 100% /snap/snapd/16010
/dev/loop7 56M 56M 0 100% /snap/core18/2409
/dev/loop8 62M 62M 0 100% /snap/core20/1518
/dev/loop9 68M 68M 0 100% /snap/lxd/22753
tmpfs 766M 0 766M 0% /run/user/0
tmpfs 766M 20K 766M 1% /run/user/126
du -sh /
...
35G /
lsof / | grep deleted
[什么也没发生]
我还检查过htop
:
这似乎表明 Blazegraph 的多个用途正在运行并占用空间:
我对此有点困惑...所以我已将 Blazegraph(及其更新程序)设置init.d
为文件夹中的服务,但它们应该只启动一次,对吗?我想不出它为什么会如此频繁地出现。
我的init.d
样子是这样的:
#!/bin/sh
### BEGIN INIT INFO
# Provides: blazegraph
# Required-Start: $local_fs $network $named $time $syslog
# Required-Stop: $local_fs $network $named $time $syslog
# Default-Start: 2 3 4 5
# Default-Stop: 0 1 6
# Short-Description: blazegraph jetty server
# Description: Start blazegraph on a jetty server
### END INIT INFO
SCRIPT='sudo BLAZEGRAPH_OPTS="-DwikibaseConceptUri=https://placeholder.com" bash /var/lib/mediawiki/extensions/wikidata-query-rdf/dist/target/service-0.3.111-SNAPSHOT/runBlazegraph.sh'
PIDFILE=/var/run/blazegraph.pid
start() {
if [ -f /var/run/$PIDNAME ] && kill -0 $(cat /var/run/$PIDNAME); then
echo 'Service already running' >&2
return 1
fi
echo 'Starting service…' >&2
local CMD="$SCRIPT"
su -c "$CMD" > "$PIDFILE"
echo 'Service started' >&2
}
stop() {
if [ ! -f "$PIDFILE" ] || ! kill -0 $(cat "$PIDFILE"); then
echo 'Service not running' >&2
return 1
fi
echo 'Stopping service…' >&2
kill -15 $(cat "$PIDFILE") && rm -f "$PIDFILE"
echo 'Service stopped' >&2
}
uninstall() {
echo -n "Are you really sure you want to uninstall this service? That cannot be undone. [yes|No] "
local SURE
read SURE
if [ "$SURE" = "yes" ]; then
stop
rm -f "$PIDFILE"
#echo "Notice: log file is not be removed: '$LOGFILE'" >&2
update-rc.d -f <NAME> remove
rm -fv "$0"
fi
}
case "$1" in
start)
start
;;
stop)
stop
;;
uninstall)
uninstall
;;
retart)
stop
start
;;
*)
echo "Usage: $0 {start|stop|restart|uninstall}"
esac
此外,应用程序的一切似乎都正常工作?Blazegraph 可以更新并可查询;前端的更改显示正常。即使没有空间,这些进程似乎也能顺利运行。
有什么想法吗?我想知道这是否是我设置服务时的错误init.d
?我遇到了 Wikibase 的一位项目经理,她从未见过此错误,而且 Blazegraph 似乎没有记录此错误,这让我认为这是一个服务错误?