========== slurm-llnl ========== Following discussion with slurm-llnl's maintainer, here's a testing setup: Create 3 VMs: * one with *slurmd* (work node, 2 CPUs) * one with *slurmctld* * and one with *slurmdbd*. The hostname are the services they run (populate */etc/hostname* and */etc/hosts* accordingly). *slurm.conf* and *slurmdbd.conf* are below.<
> They all share the same */etc/munge/munge.key* file. Make sure munged is running everywhere (*update-rc.d munge enable*). */etc/slurm-llnl/slurm.conf*: :: ControlMachine=slurmctld AuthType=auth/munge CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear AccountingStorageEnforce=association AccountingStorageHost=slurmdbd AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/linux JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log NodeName=slurmd CPUs=2 State=UNKNOWN PartitionName=debug Nodes=slurmd Default=YES MaxTime=INFINITE State=UP */etc/slurm-llnl/slurmdbd.conf*: :: AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdHost=localhost DebugLevel=3 StorageHost=localhost StorageLoc=slurm StoragePass=shazaam StorageType=accounting_storage/mysql StorageUser=slurm LogFile=/var/log/slurm-llnl/slurmdbd.log PidFile=/var/run/slurm-llnl/slurmdbd.pid SlurmUser=slurm ArchiveDir=/var/log/slurm-llnl/ ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=yes ArchiveSuspend=yes PurgeEventAfter=1hour PurgeJobAfter=1hour PurgeResvAfter=1hour PurgeStepAfter=1hour PurgeSuspendAfter=1hour On *slurmdbd* create a MySQL database called *slurm* with write permission for user *slurm* with password *shazaam*: :: CREATE DATABASE slurm; GRANT ALL PRIVILEGES ON slurm.* TO 'slurm' IDENTIFIED BY 'shazaam'; With *sacctmgr* (package *slurm-client*) add a cluster, an account and a user: :: sacctmgr -i add cluster cluster sacctmgr -i add account oliva Cluster=cluster sacctmgr -i add user oliva Account=oliva Then run a couple of jobs as user *oliva* with *srun* or *sbatch*: you can see them in the cluster history with *sacct*. :: # nodes status slurmctld# sinfo # send job slurmctld# srun -l /bin/hostname # list jobs slurmctld# sacct # reset node (e.g. stuck in 'alloc' state) slurmctld# scontrol update NodeName=slurmd State=down reason=x slurmctld# scontrol update NodeName=slurmd State=resume Given the settings of the *slurmdbd.conf* below, this job information are purged at the beginning of the hour after the job has run and are stored in two files called: :: cluster_job_archive_2019-12-09T01:00:00_2019-12-09T01:59:59 cluster_step_archive_2019-12-09T01:00:00_2019-12-09T01:59:59 with the current date under */var/log/slurm-llnl/*. CVE-2019-12838 note: to reproduce, try to reload the files with the command: :: sacctmgr archive load file=/var/log/slurm-llnl/... See also https://slurm.schedmd.com/quickstart.html and https://slurm.schedmd.com/troubleshoot.html | Copyright (C) 2020, 2021, 2022 Sylvain Beucler