我正在尝试在 Ubuntu 23.10 系统上配置 SLURM,以便它通过 使用 MySQL slurmdbd
。这是我之前的一个问题的延续解决了通过一些随机猜测......
有趣的是,SLURM 控制器 ( slurmctld
) 在启动时无法启动。但是,当我手动重新启动服务时,它似乎正常。
例如,如果我sudo service slurmctld status
在启动后输入,我会看到这些消息:
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: error: Sending PersistInit msg: Connection refused
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: accounting_storage/slurmdbd: clusteracct_storage_p_register_ctld: Registering slurmctld at port 6817 with slurmdbd
Feb 03 17:10:26 mycomputer slurmctld[1682]: slurmctld: No memory enforcing mechanism configured.
Feb 03 17:10:27 mycomputer slurmctld[1682]: WARNING: MYSQL_OPT_RECONNECT is deprecated and will be removed in a future version.
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: error: mysql_real_connect failed: 2002 Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)
Feb 03 17:10:27 mycomputer slurmctld[1682]: slurmctld: fatal: You haven't inited this storage yet.
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Main process exited, code=exited, status=1/FAILURE
Feb 03 17:10:27 mycomputer systemd[1]: slurmctld.service: Failed with result 'exit-code'.
这与日志文件中的信息类似/var/log/
。但是,如果我使用 重新启动它sudo service slurmctld restart
,而不更改任何配置文件,它会在日志中启动并显示以下内容:
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered information about 0 jobs
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Recovered state of 0 reservations
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: read_slurm_conf: backup_controller not specified
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: select_p_reconfigure: select/cons_tres: reconfigure
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: select/cons_tres: part_data_create_array: select/cons_tres: preparing for 1 partitions
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: Running as primary controller
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: No parameter for mcs plugin, default values set
Feb 03 23:22:57 mycomputer slurmctld[30777]: slurmctld: mcs: MCSParameters = (null). ondemand set.
Feb 03 23:23:02 mycomputer slurmctld[30777]: slurmctld: SchedulerParameters=default_queue_depth=100,max_rpc_cnt=0,max_sched_time=2,partition_job_depth=0,sched_max_job_start=0,...
现在看上去还不错。
slurmdbd
我唯一的猜测是,这可能与、slurmd
和服务的启动顺序有关slurmctld
。但我一直认为默认顺序是正确的。也许这个假设是错误的?
答案1
slurmctld.service 和 slurmd.service 的默认值缺少对 mysql.service 的排序依赖。让我们添加一个二(感谢@Ray 的澄清)。
创建一个名为的文件/etc/systemd/system/slurmctld.service.d/99-mysql-ordering-askubuntu-1502374.conf
:
[Unit]
# This will append the missing dependency to the defaults
After=slurmdbd.service
创建一个名为的文件/etc/systemd/system/slurmd.service.d/99-mysql-ordering-askubuntu-1502374.conf
:
[Unit]
# This will append the missing dependency to the defaults
After=slurmctld.service
然后重新启动。