slurm-llnl¶
Following discussion with slurm-llnl’s maintainer, here’s a testing setup:
Create 3 VMs:
one with slurmd (work node, 2 CPUs)
one with slurmctld
and one with slurmdbd.
The hostname are the services they run (populate /etc/hostname and /etc/hosts accordingly). slurm.conf and slurmdbd.conf are below.<<BR>> They all share the same /etc/munge/munge.key file. Make sure munged is running everywhere (update-rc.d munge enable).
/etc/slurm-llnl/slurm.conf:
ControlMachine=slurmctld
AuthType=auth/munge
CryptoType=crypto/munge
MpiDefault=none
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
SlurmUser=slurm
StateSaveLocation=/var/lib/slurm-llnl/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
FastSchedule=1
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
AccountingStorageEnforce=association
AccountingStorageHost=slurmdbd
AccountingStorageType=accounting_storage/slurmdbd
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompType=jobcomp/linux
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
NodeName=slurmd CPUs=2 State=UNKNOWN
PartitionName=debug Nodes=slurmd Default=YES MaxTime=INFINITE State=UP
/etc/slurm-llnl/slurmdbd.conf:
AuthType=auth/munge
AuthInfo=/var/run/munge/munge.socket.2
DbdHost=localhost
DebugLevel=3
StorageHost=localhost
StorageLoc=slurm
StoragePass=shazaam
StorageType=accounting_storage/mysql
StorageUser=slurm
LogFile=/var/log/slurm-llnl/slurmdbd.log
PidFile=/var/run/slurm-llnl/slurmdbd.pid
SlurmUser=slurm
ArchiveDir=/var/log/slurm-llnl/
ArchiveEvents=yes
ArchiveJobs=yes
ArchiveResvs=yes
ArchiveSteps=yes
ArchiveSuspend=yes
PurgeEventAfter=1hour
PurgeJobAfter=1hour
PurgeResvAfter=1hour
PurgeStepAfter=1hour
PurgeSuspendAfter=1hour
On slurmdbd create a MySQL database called slurm with write permission for user slurm with password shazaam:
CREATE DATABASE slurm;
GRANT ALL PRIVILEGES ON slurm.* TO 'slurm' IDENTIFIED BY 'shazaam';
With sacctmgr (package slurm-client) add a cluster, an account and a user:
sacctmgr -i add cluster cluster
sacctmgr -i add account oliva Cluster=cluster
sacctmgr -i add user oliva Account=oliva
Then run a couple of jobs as user oliva with srun or sbatch: you can see them in the cluster history with sacct.
# nodes status
slurmctld# sinfo
# send job
slurmctld# srun -l /bin/hostname
# list jobs
slurmctld# sacct
# reset node (e.g. stuck in 'alloc' state)
slurmctld# scontrol update NodeName=slurmd State=down reason=x
slurmctld# scontrol update NodeName=slurmd State=resume
Given the settings of the slurmdbd.conf below, this job information are purged at the beginning of the hour after the job has run and are stored in two files called:
cluster_job_archive_2019-12-09T01:00:00_2019-12-09T01:59:59
cluster_step_archive_2019-12-09T01:00:00_2019-12-09T01:59:59
with the current date under /var/log/slurm-llnl/.
CVE-2019-12838 note: to reproduce, try to reload the files with the command:
sacctmgr archive load file=/var/log/slurm-llnl/...
See also https://slurm.schedmd.com/quickstart.html and https://slurm.schedmd.com/troubleshoot.html