Slurm Setup Guide for Ori Infrastructure
Complete guide for installing and configuring Slurm on an Ori VM for AI/ML workloads with GPU support and job accounting.
Table of Contents
- Overview
- Prerequisites
- Installation Steps
- Starting Services
- Job Submission
- Monitoring and Management
- Troubleshooting
Overview
This guide covers setting up a single-node Slurm cluster on an Ori VM that acts as both:
- Controller node (
slurmctld) - Manages the cluster and schedules jobs - Compute node (
slurmd) - Executes the jobs - Job tracker (
slurmdbd) - Tracks all historical jobs
Key Features:
- GPU scheduling (2x NVIDIA H100 80GB GPUs)
- Job accounting and history tracking
- Resource isolation with cgroups
- Job dependencies and chaining
- MariaDB backend for persistent job records
Prerequisites
VM Specifications:
- OS: Ubuntu 24.04 LTS
- CPUs: 48 cores
- RAM: 483 GB
- GPUs: 2x NVIDIA H100SXM 80GB
- Storage: Sufficient space for Slurm state and logs
Network Requirements:
- SSH access to the VM
- Hostname resolution configured
Installation Steps
1. System Update and Dependencies
Update the system and install required packages:
sudo apt update && sudo apt upgrade -y
sudo apt install -y \
build-essential \
git \
wget \
curl \
munge \
libmunge-dev \
mariadb-client \
libmariadb-dev \
libhwloc-dev \
libjson-c-dev \
libhttp-parser-dev \
libyaml-dev \
libjwt-dev \
libdbus-1-dev \
python3 \
python3-pip
2. MUNGE Authentication
MUNGE provides authentication between Slurm components. All nodes in the cluster must have the same MUNGE key.
Generate MUNGE Key
Check if MUNGE key already exists:
ls -la /etc/munge/munge.key
If the key doesn't exist, generate it manually:
# Create MUNGE key directory if it doesn't exist
sudo mkdir -p /etc/munge
# Generate a new MUNGE key
sudo dd if=/dev/urandom bs=1 count=1024 > /etc/munge/munge.key
# Alternative: Use create-munge-key if available
sudo create-munge-key -f
Set Correct Permissions
MUNGE is very strict about file permissions. Set the correct ownership and permissions:
# Set ownership to munge user
sudo chown munge:munge /etc/munge/munge.key
# Set restrictive permissions (only munge user can read)
sudo chmod 400 /etc/munge/munge.key
# Verify permissions
ls -la /etc/munge/munge.key
# Expected output: -r-------- 1 munge munge 1024 <date> /etc/munge/munge.key
Verify MUNGE Key
Check that the key is valid and has correct format:
# Verify key is readable by munge
sudo -u munge cat /etc/munge/munge.key > /dev/null && echo "Key is readable" || echo "Key is NOT readable"
Start and Enable MUNGE
# Enable MUNGE to start on boot
sudo systemctl enable munge
# Start MUNGE service
sudo systemctl start munge
# Verify MUNGE is running
sudo systemctl status munge
Create Slurm user:
sudo groupadd -g 64030 slurm
sudo useradd -u 64030 -g slurm -s /bin/bash -d /var/lib/slurm slurm
3. Slurm Installation
Download and compile Slurm from source:
cd /tmp
wget https://download.schedmd.com/slurm/slurm-24.05.3.tar.bz2
tar -xjf slurm-24.05.3.tar.bz2
cd slurm-24.05.3
./configure \
--prefix=/usr \
--sysconfdir=/etc/slurm \
--with-munge \
--with-hwloc \
--with-json \
--with-http-parser \
--with-yaml \
--with-jwt \
--enable-pam
make -j$(nproc)
sudo make install
Verify installation:
slurmctld --version
# Output: slurm 24.05.3
4. Configuration
Create Slurm Directories
sudo mkdir -p /etc/slurm /var/spool/slurm/ctld /var/spool/slurm/d /var/log/slurm
sudo chown -R slurm:slurm /etc/slurm /var/spool/slurm /var/log/slurm
Configuration Files
/etc/slurm/slurm.conf - Main Slurm configuration:
# Slurm Configuration File for Single-Node Setup
ClusterName=ori-slurm-poc
SlurmctldHost=virtual-machine
# Authentication
AuthType=auth/munge
CryptoType=crypto/munge
# Scheduling
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_Core_Memory,CR_ONE_TASK_PER_CORE
# Logging
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm/slurmd.log
# Process Tracking
ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
# GRES (GPU) support
GresTypes=gpu
# State preservation
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
# Timeouts
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
# Job Defaults
DefMemPerCPU=2048
# Accounting
AccountingStorageType=accounting_storage/slurmdbd
AccountingStorageHost=localhost
AccountingStoragePort=6819
AccountingStorageEnforce=associations
JobAcctGatherType=jobacct_gather/linux
JobAcctGatherFrequency=30
# Node Definitions (adjust CPUs and RealMemory based on your VM)
NodeName=virtual-machine CPUs=48 RealMemory=483000 Gres=gpu:h100:2 State=UNKNOWN
# Partition Definitions
PartitionName=gpu Nodes=virtual-machine Default=YES MaxTime=INFINITE State=UP OverSubscribe=NO
/etc/slurm/gres.conf - GPU resource configuration:
# GPU Resource Configuration
AutoDetect=nvml
NodeName=virtual-machine Name=gpu Type=h100 File=/dev/nvidia0
NodeName=virtual-machine Name=gpu Type=h100 File=/dev/nvidia1
/etc/slurm/cgroup.conf - Resource isolation to prevent jobs using resources allocated to other jobs:
# Cgroup Configuration for Resource Isolation
ConstrainCores=yes
ConstrainDevices=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
Systemd Service Files
/etc/systemd/system/slurmctld.service - Controller daemon:
[Unit]
Description=Slurm controller daemon
After=network.target munge.service
Requires=munge.service
[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/sbin/slurmctld -D
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Restart=on-failure
[Install]
WantedBy=multi-user.target
/etc/systemd/system/slurmd.service - Compute node daemon:
[Unit]
Description=Slurm node daemon
After=network.target munge.service
Requires=munge.service
[Service]
Type=simple
User=root
Group=root
ExecStart=/usr/sbin/slurmd -D
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
LimitMEMLOCK=infinity
LimitSTACK=infinity
Restart=on-failure
[Install]
WantedBy=multi-user.target
5. Database Setup for Job Accounting
Install MariaDB and configure Slurm Database Daemon (slurmdbd) for persistent job tracking.
Install MariaDB
sudo apt install -y mariadb-server mariadb-client
sudo systemctl enable mariadb
sudo systemctl start mariadb
Create Slurm Accounting Database
sudo mysql -e "CREATE DATABASE slurm_acct_db;"
sudo mysql -e "CREATE USER 'slurm'@'localhost' IDENTIFIED BY 'slurmdbpass';"
sudo mysql -e "GRANT ALL ON slurm_acct_db.* TO 'slurm'@'localhost';"
sudo mysql -e "FLUSH PRIVILEGES;"
Configure slurmdbd
/etc/slurm/slurmdbd.conf:
# Slurm Database Daemon Configuration
AuthType=auth/munge
DbdHost=localhost
DebugLevel=info
LogFile=/var/log/slurm/slurmdbd.log
PidFile=/var/run/slurmdbd.pid
SlurmUser=slurm
# Database connection
StorageType=accounting_storage/mysql
StorageHost=localhost
StoragePort=3306
StorageUser=slurm
StoragePass=slurmdbpass
StorageLoc=slurm_acct_db
Set permissions:
sudo chown slurm:slurm /etc/slurm/slurmdbd.conf
sudo chmod 600 /etc/slurm/slurmdbd.conf
/etc/systemd/system/slurmdbd.service:
[Unit]
Description=Slurm Database Daemon
After=network.target munge.service mariadb.service
Requires=munge.service mariadb.service
[Service]
Type=simple
User=slurm
Group=slurm
ExecStart=/usr/sbin/slurmdbd -D
ExecReload=/bin/kill -HUP $MAINPID
KillMode=process
LimitNOFILE=131072
Restart=on-failure
[Install]
WantedBy=multi-user.target
Starting Services
Start all Slurm services in the correct order:
# Reload systemd
sudo systemctl daemon-reload
# Start slurmdbd (database daemon)
sudo systemctl enable slurmdbd
sudo systemctl start slurmdbd
# Start slurmctld (controller)
sudo systemctl enable slurmctld
sudo systemctl start slurmctld
# Start slurmd (compute node)
sudo systemctl enable slurmd
sudo systemctl start slurmd
# Verify services
sudo systemctl status slurmdbd
sudo systemctl status slurmctld
sudo systemctl status slurmd
Setup Accounting
# Add cluster to accounting
sudo sacctmgr -i add cluster ori-slurm-poc
# Create default account
sudo sacctmgr -i add account default Description='Default Account' Organization='Ori'
# Add user to account
sudo sacctmgr -i add user ubuntu Account=default
# Verify
sacctmgr list associations
Activate the Node
# Update node state
sudo scontrol update nodename=virtual-machine state=resume
# Clear any error messages
sudo scontrol update nodename=virtual-machine reason="Node operational"
# Verify cluster status
sinfo
scontrol show node virtual-machine
Expected output:
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
gpu* up infinite 1 idle virtual-machine
Job Submission
Single Job Submission
Script: ~/scripts/submit-3min-job.sh
This script creates and submits a 3-minute test job that uses both GPUs.
#!/bin/bash
# Script to create and submit a 3-minute test job
TIMESTAMP=$(date +%s)
JOB_NAME="test-job-${TIMESTAMP}"
JOB_FILE="/tmp/${JOB_NAME}.sbatch"
echo "Creating job script: ${JOB_NAME}"
cat > ${JOB_FILE} << "JOBEND"
#!/bin/bash
#SBATCH --job-name=test-3min
#SBATCH --output=/tmp/test-3min-%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:h100:2
#SBATCH --time=00:10:00
#SBATCH --account=default
echo "=== Job Information ==="
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Node: $SLURMD_NODENAME"
echo "GPUs allocated: $CUDA_VISIBLE_DEVICES"
echo "CPUs allocated: $SLURM_CPUS_PER_TASK"
echo "Start time: $(date)"
echo ""
echo "=== GPU Information ==="
nvidia-smi --query-gpu=index,name,memory.total --format=csv
echo ""
echo "=== Running for 3 minutes ==="
for i in {1..180}; do
if [ $((i % 30)) -eq 0 ]; then
echo "Progress: $i/180 seconds"
fi
sleep 1
done
echo ""
echo "=== Job Complete ==="
echo "End time: $(date)"
JOBEND
echo "Submitting job to Slurm..."
sbatch ${JOB_FILE}
Usage:
chmod +x ~/scripts/submit-3min-job.sh
~/scripts/submit-3min-job.sh
Dependent Job Submission
Script: ~/scripts/submit-dependent-jobs.sh
This script demonstrates job chaining where Job B waits for Job A to complete successfully.
#!/bin/bash
# Script to submit two jobs with dependency: Job B starts after Job A completes
echo "=== Creating Job A (runs for 1 minute) ==="
cat > /tmp/job-a.sbatch << "JOBA"
#!/bin/bash
#SBATCH --job-name=job-A
#SBATCH --output=/tmp/job-a-%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:h100:1
#SBATCH --time=00:10:00
#SBATCH --account=default
echo "=== Job A Started ==="
echo "Job ID: $SLURM_JOB_ID"
echo "Start time: $(date)"
echo "Running for 1 minute..."
for i in {1..60}; do
if [ $((i % 15)) -eq 0 ]; then
echo "Job A progress: $i/60 seconds"
fi
sleep 1
done
echo "Job A completed at: $(date)"
JOBA
echo "=== Creating Job B (runs for 2 minutes, depends on Job A) ==="
cat > /tmp/job-b.sbatch << "JOBB"
#!/bin/bash
#SBATCH --job-name=job-B
#SBATCH --output=/tmp/job-b-%j.out
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --gres=gpu:h100:1
#SBATCH --time=00:10:00
#SBATCH --account=default
echo "=== Job B Started ==="
echo "Job ID: $SLURM_JOB_ID"
echo "Start time: $(date)"
echo "Running for 2 minutes..."
for i in {1..120}; do
if [ $((i % 30)) -eq 0 ]; then
echo "Job B progress: $i/120 seconds"
fi
sleep 1
done
echo "Job B completed at: $(date)"
JOBB
echo ""
echo "=== Submitting Job A ==="
JOB_A_OUTPUT=$(sbatch /tmp/job-a.sbatch)
JOB_A_ID=$(echo $JOB_A_OUTPUT | awk '{print $4}')
echo "Job A submitted: $JOB_A_OUTPUT"
echo "Job A ID: $JOB_A_ID"
echo ""
echo "=== Submitting Job B (depends on Job A completing) ==="
JOB_B_OUTPUT=$(sbatch --dependency=afterok:$JOB_A_ID /tmp/job-b.sbatch)
JOB_B_ID=$(echo $JOB_B_OUTPUT | awk '{print $4}')
echo "Job B submitted: $JOB_B_OUTPUT"
echo "Job B ID: $JOB_B_ID"
echo ""
echo "=== Job Dependency Summary ==="
echo "Job A (ID: $JOB_A_ID) - Will run immediately"
echo "Job B (ID: $JOB_B_ID) - Will wait for Job A to complete successfully"
echo ""
echo "To monitor: squeue"
echo "Job A output: /tmp/job-a-$JOB_A_ID.out"
echo "Job B output: /tmp/job-b-$JOB_B_ID.out"
Usage:
chmod +x ~/scripts/submit-dependent-jobs.sh
~/scripts/submit-dependent-jobs.sh
Dependency Types:
--dependency=afterok:JOBID- Start after job completes successfully (exit code 0)--dependency=after:JOBID- Start after job completes (any exit code)--dependency=afternotok:JOBID- Start only if job fails--dependency=afterany:JOBID- Start after job ends (any state)
Monitoring and Management
Check Job Queue
# View running and pending jobs
squeue
# Detailed queue information
squeue -l
# Custom format
squeue -o '%.8i %.12j %.10u %.8T %.10M %.6D %.20b %R'
Check Job History
# View completed jobs (today)
sacct
# View all jobs from specific date
sacct --starttime=2026-01-05
# Show only main jobs (no steps)
sacct -X --format=JobID,JobName,State,AllocCPUS,Elapsed,Start,End
# Specific job details
sacct -j <job_id> --format=JobID,JobName,State,AllocCPUS,AllocTRES,Elapsed
Cluster Status
# View partition status
sinfo
# Detailed node information
scontrol show node virtual-machine
# Show partition details
scontrol show partition gpu
Job Control
# Cancel a job
scancel <job_id>
# Hold a pending job (prevent from starting)
scontrol hold <job_id>
# Release a held job
scontrol release <job_id>
# Suspend a running job (pause)
scontrol suspend <job_id>
# Resume a suspended job
scontrol resume <job_id>
View Job Output
# Tail output file while job is running
tail -f /tmp/test-3min-<job_id>.out
# View completed job output
cat /tmp/test-3min-<job_id>.out
Troubleshooting
Services Not Starting
Check service status:
sudo systemctl status slurmctld
sudo systemctl status slurmd
sudo systemctl status slurmdbd
Check logs:
sudo tail -50 /var/log/slurm/slurmctld.log
sudo tail -50 /var/log/slurm/slurmd.log
sudo tail -50 /var/log/slurm/slurmdbd.log
Node in DRAIN or DOWN State
# Check node status
sinfo
scontrol show node virtual-machine
# Resume node
sudo scontrol update nodename=virtual-machine state=resume
# Clear error reason
sudo scontrol update nodename=virtual-machine reason="Operational"
Jobs Not Showing in sacct
By default, sacct only shows jobs from the current day. Use --starttime to see older jobs:
sacct --starttime=2026-01-01
GPU Not Detected
# Verify GPUs are visible
nvidia-smi
# Check GRES configuration
scontrol show node virtual-machine | grep Gres
# Restart slurmd
sudo systemctl restart slurmd
Memory Mismatch Errors
If you see errors about RealMemory mismatch:
-
Check actual memory:
free -m | grep Mem | awk '{print $2}' -
Update
slurm.conf:sudo vi /etc/slurm/slurm.conf
# Adjust RealMemory value to match or be slightly less than actual -
Restart services:
sudo systemctl restart slurmctld slurmd
Summary
You now have a fully functional Slurm cluster on a single Ori VM with:
- Controller and compute functionality
- GPU scheduling (2x H100)
- Job accounting with MariaDB
- Job dependency support
- Historical job tracking
Quick Command Reference
# Submit job
sbatch myjob.sbatch
# Monitor jobs
squeue
# View history
sacct --starttime=today
# Check cluster
sinfo
# Cancel job
scancel <job_id>
# Node status
scontrol show node