/tmp 搞砸了,甚至 ls /tmp 失败

/tmp 搞砸了,甚至 ls /tmp 失败

我们看到一个小集群变得无法使用。最初,相同的行为发生在计算节点上,现在发生在头节点上。我不知道这是否是底层源,但可以肯定的是,/tmp目录中的某些东西被搞砸了,所以甚至ls /tmp挂起并且无法被杀死。 (/tmp在下面/,而不是挂载nfs,我可以看到其他所有内容,例如/var/log/proc等等)因为有很多守护进程和正在运行的任务期望访问/tmp,所以对我来说这是有意义的,这是问题的一个重要部分。

硬重启可以暂时解决问题,但这并不是长久之计。

欢迎提出建议,不仅仅是运行“ls -ld /tmp &”,它不会比 ls 做更多的事情......

注意:出现问题时,/tmp 被搞砸了;否则(就像现在一样)就可以了:

[ldm@head ~]$ df -h /tmp
Filesystem      Size  Used Avail Use% Mounted on
/dev/md126      221G  143G   78G  65% /
[ldm@head ~]$ ls -ld /tmp
drwxrwxrwt. 12 root root 20480 Jan 26 08:45 /tmp

以供参考:

uname -a
"Linux head.cluster 3.10.0-1062.1.1.el7.x86_64 #1 SMP Fri Sep 13 22:55:44 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux 

cat /etc/os-release
NAME="CentOS Linux"
VERSION="7 (Core)"

问题是间歇性的。它只是重新出现在其中一个计算节点上,最后显示 dmseg -H:
[Feb 7 00:51] INFO: task kworker/4:2:20770阻塞超过120秒。
[+0.007162]“echo 0 > /proc/sys/kernel/hung_task_timeout_secs”禁用此消息。
[+0.008112] kworker/4:2 D ffff985b47709040 0 20770 2
[+0.007307]工作队列:事件xprt_rdma_connect_worker [rpcrdma]
[+0.006210]调用跟踪:
[+0.002638][]schedule+0x29/0x70
[+0.0051 59] [] 日程超时+0x221/0x2d0
[ +0.006035] [] ? mthca_modify_qp+0x8f/0x310 [ib_mthca]
[ +0.006988] [] wait_for_completion+0xfd/0x140
[ +0.006204] [] ? wake_up_state+0x20/0x20
[+0.005776] [] __ib_drain_sq+0x181/0x1c0 [ib_core]
[+0.006638] [] ? ib_sg_to_pages+0x1a0/0x1a0 [ib_core]
[ +0.006902] [] ib_drain_sq+0x25/0x30 [ib_core]
[ +0.006292] [] ib_drain_qp+0x12/0x30 [ib_core]
[ +0.006291] [] rpcrdma_ep_disconnect+ 0x58/0x150 [rpcrdma]
[+0.007244][]rpcrdma_ep_connect+0x139/0x400[rpcrdma]
[+0.007073][]? wake_up_atomic_t+0x30/0x30
[ +0.006022] [] xprt_rdma_connect_worker+0x33/0x60 [rpcrdma]
[ +0.007505] [] process_one_work+0x17f/0x440
[ +0.006022] []worker_thread+0x126/0x3c0
[ +0.005第765章manage_workers.isra.25+0x2a0/0x2a0
[ +0.006725] [] kthread+0xd1/0xe0
[ +0.005071] [] ? insert_kthread_work+0x40/0x40
[+0.006285] [] ret_from_fork_nospec_begin+0x21/0x21
[+0.006714] [] ? insert_kthread_work+0x40/0x40
ls -ld /tmp
drwxrwxrwt 8 root root 169 Feb 7 11:28 /tmp
ls -ld /boot
dr-xr-xr-x 5 root root 4096 Jan 16 12:09 /boot
ls -ld / 挂起-- NFS 挂载似乎已失效。

相关内容