==========
slurm-llnl
==========

Following discussion with slurm-llnl's maintainer, here's a testing setup:

Create 3 VMs:

 * one with *slurmd* (work node, 2 CPUs)
 * one with *slurmctld*
 * and one with *slurmdbd*.

The hostname are the services they run (populate */etc/hostname* and */etc/hosts* accordingly). 
*slurm.conf* and *slurmdbd.conf* are below.<<BR>>
They all share the same */etc/munge/munge.key* file. Make sure munged is running everywhere (*update-rc.d munge enable*).


*/etc/slurm-llnl/slurm.conf*:
::

 ControlMachine=slurmctld
 AuthType=auth/munge
 CryptoType=crypto/munge
 MpiDefault=none
 ProctrackType=proctrack/pgid
 ReturnToService=1
 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
 SlurmctldPort=6817
 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
 SlurmdPort=6818
 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd
 SlurmUser=slurm
 StateSaveLocation=/var/lib/slurm-llnl/slurmctld
 SwitchType=switch/none
 TaskPlugin=task/none
 InactiveLimit=0
 KillWait=30
 MinJobAge=300
 SlurmctldTimeout=120
 SlurmdTimeout=300
 Waittime=0
 FastSchedule=1
 SchedulerType=sched/backfill
 SchedulerPort=7321
 SelectType=select/linear
 AccountingStorageEnforce=association
 AccountingStorageHost=slurmdbd
 AccountingStorageType=accounting_storage/slurmdbd
 AccountingStoreJobComment=YES
 ClusterName=cluster
 JobCompType=jobcomp/linux
 JobAcctGatherFrequency=30
 JobAcctGatherType=jobacct_gather/none
 SlurmctldDebug=3
 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log
 SlurmdDebug=3
 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
 NodeName=slurmd CPUs=2 State=UNKNOWN
 PartitionName=debug Nodes=slurmd Default=YES MaxTime=INFINITE State=UP


*/etc/slurm-llnl/slurmdbd.conf*:
::

 AuthType=auth/munge
 AuthInfo=/var/run/munge/munge.socket.2
 DbdHost=localhost
 DebugLevel=3
 StorageHost=localhost
 StorageLoc=slurm
 StoragePass=shazaam
 StorageType=accounting_storage/mysql
 StorageUser=slurm
 LogFile=/var/log/slurm-llnl/slurmdbd.log
 PidFile=/var/run/slurm-llnl/slurmdbd.pid
 SlurmUser=slurm
 ArchiveDir=/var/log/slurm-llnl/
 
 ArchiveEvents=yes
 ArchiveJobs=yes
 ArchiveResvs=yes
 ArchiveSteps=yes
 ArchiveSuspend=yes
 
 PurgeEventAfter=1hour
 PurgeJobAfter=1hour
 PurgeResvAfter=1hour
 PurgeStepAfter=1hour
 PurgeSuspendAfter=1hour


On *slurmdbd* create a MySQL database called *slurm* with
write permission for user *slurm* with password *shazaam*:
::

 CREATE DATABASE slurm;
 GRANT ALL PRIVILEGES ON slurm.* TO 'slurm' IDENTIFIED BY 'shazaam';


With *sacctmgr* (package *slurm-client*) add a cluster, an account and a user:

::

 sacctmgr -i add cluster cluster
 sacctmgr -i add account oliva Cluster=cluster
 sacctmgr -i add user oliva Account=oliva


Then run a couple of jobs as user *oliva* with *srun* or *sbatch*: you can
see them in the cluster history with *sacct*.

::

 # nodes status
 slurmctld# sinfo
 
 # send job
 slurmctld# srun -l /bin/hostname
 
 # list jobs
 slurmctld# sacct
 
 # reset node (e.g. stuck in 'alloc' state)
 slurmctld# scontrol update NodeName=slurmd State=down reason=x
 slurmctld# scontrol update NodeName=slurmd State=resume


Given the settings of the *slurmdbd.conf* below, this job information
are purged at the beginning of the hour after the job has run and are
stored in two files called:

::

 cluster_job_archive_2019-12-09T01:00:00_2019-12-09T01:59:59
 cluster_step_archive_2019-12-09T01:00:00_2019-12-09T01:59:59


with the current date under */var/log/slurm-llnl/*.

CVE-2019-12838 note: to reproduce, try to reload the files with the command:
::

 sacctmgr archive load file=/var/log/slurm-llnl/...


See also https://slurm.schedmd.com/quickstart.html and https://slurm.schedmd.com/troubleshoot.html


| Copyright (C) 2020, 2021, 2022  Sylvain Beucler