SLURM 控制器守护进程启动时出现问题

SLURM 控制器守护进程启动时出现问题

我正在尝试在 Ubuntu 23.10 系统上配置 SLURM,以便它通过 使用 MySQL slurmdbd。这是我之前的一个问题的延续解决了通过一些随机猜测......

有趣的是,SLURM 控制器 ( slurmctld) 在启动时无法启动。但是,当我手动重新启动服务时,它似乎正常。

例如,如果我sudo service slurmctld status在启动后输入,我会看到这些消息:

Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: error: Sending PersistInit msg: Connection refused
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: No memory enforcing mechanism configured.
Feb 03 17:10:27 mycomputer slurmctld[1682]: WARNING: MYSQL_OPT_RECONNECT is deprecated and will be removed in a future version.
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: fatal: You haven't inited this storage yet.
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Failed with result 'exit-code'.

这与日志文件中的信息类似/var/log/。但是,如果我使用 重新启动它sudo service slurmctld restart,而不更改任何配置文件,它会在日志中启动并显示以下内容:

Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered information about 0 jobs
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered state of 0 reservations
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: read_slurm_conf: backup_controller not specified
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Running as primary controller
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: No parameter for mcs plugin, default values set
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: mcs: MCSParameters = (null). ondemand set.
Feb 03 23:23:02 mycomputer slurmctld[30777]: slurmctld: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,...

现在看上去还不错。

slurmdbd我唯一的猜测是,这可能与、slurmd和服务的启动顺序有关slurmctld。但我一直认为默认顺序是正确的。也许这个假设是错误的?

答案1

slurmctld.service 和 slurmd.service 的默认值缺少对 mysql.service 的排序依赖。让我们添加一个二(感谢@Ray 的澄清)。

创建一个名为的文件/etc/systemd/system/slurmctld.service.d/99-mysql-ordering-askubuntu-1502374.conf

[Unit]
# This will append the missing dependency to the defaults
After=slurmdbd.service

创建一个名为的文件/etc/systemd/system/slurmd.service.d/99-mysql-ordering-askubuntu-1502374.conf

[Unit]
# This will append the missing dependency to the defaults
After=slurmctld.service

然后重新启动。

相关内容