我遇到了一个奇怪的问题,据我所知,根本原因与 CGroups 有关或直接相关。不幸的是,我对 SystemD 或 CGroups 了解不够,不知道如何准确排除故障和诊断根本原因。我做知道的是,几乎任何用户 SystemD 服务(即systemctl --user start xyz
)都无法启动,并且错误消息相当一致:
[email protected]: Failed to enable/disable controllers on cgroup /user.slice/user-1000.slice/[email protected], ignoring: Permission denied
[email protected]: Main process exited, code=exited, status=219/CGROUP
这是在 Ubuntu 22.04 上,到目前为止我已经尝试了以下内核版本:
- v6.5.0-17-通用
- v6.5.0-15-通用
systemctl --failed
:
UNIT LOAD ACTIVE SUB DESCRIPTION
● apcupsd.service loaded failed failed UPS power management daemon
● apply-cgroups.service loaded failed failed Apply CGroup settings
● dnsmasq.service loaded failed failed dnsmasq - A lightweight DHCP and caching DNS server
● grub-common.service loaded failed failed Record successful boot for GRUB
● kerneloops.service loaded failed failed Tool to automatically collect and submit kernel crash signatures
● nvidia-persistenced.service loaded failed failed NVIDIA Persistence Daemon
● rpc-statd.service loaded failed failed NFS status monitor for NFSv2/3 locking.
● [email protected] loaded failed failed User Manager for UID 1000
● [email protected] loaded failed failed User Manager for UID 1004
● [email protected] loaded failed failed User Manager for UID 130
● vmware-tools.service loaded failed failed VMWare Tools Service
这apply-cgroups.service
是我写的东西,内容如下(我已禁用此服务,并重新启动,但似乎没有任何改变):
[Unit]
Description=Apply CGroup settings
[Service]
Type=oneshot
ExecStart=cgcreate -g cpu,cpuacct:/foo-app
ExecStart=cgset -r cpu.cfs_quota_us=750000 /foo-app
ExecStart=cgset -r memory.limit_in_bytes=8G /foo-app
[Install]
WantedBy=multi-user.target
的内容/sys/fs/cgroup
,我的研究告诉我,这表明 CGroups v2 不可用:
-rw-r--r-- 1 root root 0 Feb 14 01:17 cgroup.clone_children
-rw-r--r-- 1 root root 0 Feb 14 01:17 cgroup.procs
-r--r--r-- 1 root root 0 Feb 14 01:17 cgroup.sane_behavior
drwxr-xr-x 2 root root 0 Feb 14 01:20 dev-hugepages.mount
--w------- 1 root root 0 Feb 14 01:17 devices.allow
--w------- 1 root root 0 Feb 14 01:17 devices.deny
-r--r--r-- 1 root root 0 Feb 14 01:17 devices.list
drwxr-xr-x 2 root root 0 Feb 14 01:20 dev-mqueue.mount
-rw-r--r-- 1 root root 0 Feb 14 01:17 notify_on_release
drwxr-xr-x 2 root root 0 Feb 14 01:20 proc-fs-nfsd.mount
drwxr-xr-x 2 root root 0 Feb 14 01:20 proc-sys-fs-binfmt_misc.mount
-rw-r--r-- 1 root root 0 Feb 14 01:17 release_agent
drwxr-xr-x 2 root root 0 Feb 14 01:20 sys-fs-fuse-connections.mount
drwxr-xr-x 2 root root 0 Feb 14 01:20 sys-kernel-config.mount
drwxr-xr-x 2 root root 0 Feb 14 01:20 sys-kernel-debug.mount
drwxr-xr-x 2 root root 0 Feb 14 01:20 sys-kernel-tracing.mount
drwxr-xr-x 99 root root 0 Feb 14 01:45 system.slice
-rw-r--r-- 1 root root 0 Feb 14 01:17 tasks
drwxr-xr-x 5 root root 0 Feb 14 01:29 user.slice
cat /proc/filesystems
表明cgroup
和cgroup2
均可用:
...
nodev cgroup
nodev cgroup2
...
在重新安装软件包作为故障排除步骤的过程中,我定期看到此错误消息,恐怕我甚至不明白这个服务是什么或者它试图实现什么:
Failed to reload daemon: Failed to activate service 'org.freedesktop.systemd1': timed out (service_start_timeout=25000ms)
最后,我从一些故障排除建议中发现了一些可能提供线索的东西:
02:32:20 > systemd --user
Cannot determine cgroup we are running in: No medium found
Failed to allocate manager object: No medium found
这是什么意思?它正在寻找什么样的媒介?我的主要问题是,我不太了解诊断这个问题根本原因的工具,所以我一直在徒劳地摸索,这就是为什么我要向 SO 寻求答案。