Skip to main content

ML Pipeline Guide for Slurm

This is a guide for building and running ML pipelines with Slurm + Grafana + Prometheus.

Table of Contents


Overview

This guide explains how to build and run ML training pipelines on Slurm-managed GPU clusters. The pipeline fine-tunes a large language model (20B parameters) using QLoRA and is structured as a chain of dependent Slurm jobs.

Code Repository: https://github.com/ori-edge/slurm-ml-pipelines

Key Features:

  • Modular 4-stage pipeline (Setup → Preprocessing → Training → Evaluation)
  • Automatic job sequencing via Slurm dependencies
  • GPU resource allocation per stage
  • Real-time monitoring with Prometheus + Grafana

Assumptions

This guide assumes the following are already in place:

Infrastructure

  • VM provisioned with SSH access (IP address known)
  • Slurm installed and configured (controller + compute node running)
  • GPUs configured in Slurm (2x H100 80GB with GRES setup)
  • CUDA drivers installed (nvidia-smi works)

Software

  • Python 3 installed with python3-venv package
  • Docker and Docker Compose installed
  • Monitoring stack deployed (Prometheus, Grafana, DCGM exporter containers)
  • Slurm exporter configured as a systemd service
  • Pipeline code already deployed at ~/ml-pipeline/ on the VM

Network & Access

  • SSH key configured at ~/.ssh/id_rsa for VM access
  • HuggingFace account with API token for model access

For initial Slurm setup, refer to the Slurm Setup Guide.


Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│ ML PIPELINE WORKFLOW │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ Job 1 │ │ Job 2 │ │ Job 3 │ │ Job 4 │
│ │ SETUP │────▶│ PREPROCESS │────▶│ TRAINING │────▶│ EVALUATION │
│ │ │ │ │ │ │ │ │
│ │ 01_setup.py │ │02_preproc.py │ │03_training.py│ │04_eval.py │
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │ │ │ │
│ ▼ ▼ ▼ ▼
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ │ • Validate │ │ • Load data │ │ • Load model │ │ • Load model │
│ │ environment│ │ • Tokenize │ │ • Apply LoRA │ │ • Generate │
│ │ • Cache model│ │ • Pack seqs │ │ • Train DDP │ │ • Calc metrics│
│ │ • Cache data │ │ • Mask prompts│ │ • Save ckpts │ │ • Save results│
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘
│ │
│ Resources: Resources: Resources: Resources:
│ CPU: 8 cores CPU: 16 cores CPU: 32 cores CPU: 8 cores
│ GPU: 1 GPU: 0 (CPU-only) GPU: 2 GPU: 1
│ Time: 30 min Time: 30 min Time: 2 hours Time: 30 min
│ │
└─────────────────────────────────────────────────────────────────────────┘

Core Concepts

Job Dependencies

Slurm job dependencies ensure jobs execute in the correct order. A dependent job remains in PENDING state until its prerequisite completes successfully.

For ML pipelines, use afterok - this ensures each stage only runs if the previous stage succeeded.

Creating a Job Chain:

# Submit Job 1 (no dependencies)
JOB_ID_1=$(sbatch --parsable jobs/01_setup.sbatch)

# Submit Job 2 (depends on Job 1)
JOB_ID_2=$(sbatch --parsable --dependency=afterok:$JOB_ID_1 jobs/02_preprocessing.sbatch)

# Submit Job 3 (depends on Job 2)
JOB_ID_3=$(sbatch --parsable --dependency=afterok:$JOB_ID_2 jobs/03_training.sbatch)

# Submit Job 4 (depends on Job 3)
JOB_ID_4=$(sbatch --parsable --dependency=afterok:$JOB_ID_3 jobs/04_evaluation.sbatch)

What happens:

  1. All 4 jobs are submitted immediately to Slurm
  2. Job 1 starts running (state: RUNNING)
  3. Jobs 2, 3, 4 wait in queue (state: PENDING with reason: Dependency)
  4. When Job 1 completes successfully → Job 2 starts
  5. If any job fails → all subsequent jobs are cancelled

Resource Allocation

Each pipeline stage has different resource requirements. Slurm allocates exactly what each job needs.

Resource Directives in sbatch:

#SBATCH --cpus-per-task=32    # Number of CPU cores
#SBATCH --mem=64G # Memory limit
#SBATCH --gres=gpu:h100:2 # GPU type and count
#SBATCH --time=02:00:00 # Maximum runtime (HH:MM:SS)

Resource Planning by Stage:

StageCPUsMemoryGPUsTimeRationale
Setup832GB130minModel download, validation
Preprocessing1664GB030minCPU-bound tokenization
Training32128GB22hMulti-GPU training
Evaluation832GB130minSingle-GPU inference

Why allocate different resources?

  • Efficiency: Don't waste GPUs on CPU-bound tasks (preprocessing)
  • Throughput: Other jobs can use freed resources
  • Cost: Optimise GPU time

Slurm Batch Scripts

Each pipeline stage has a corresponding .sbatch file that defines:

  1. Resource requirements (SBATCH directives)
  2. Environment setup (activate venv, set paths)
  3. Script execution

Anatomy of a Batch Script:

#!/bin/bash
#SBATCH --job-name=03_training # Job identifier
#SBATCH --output=logs/03_training_%j.out # Stdout (%j = job ID)
#SBATCH --error=logs/03_training_%j.err # Stderr
#SBATCH --partition=gpu # Which partition to use
#SBATCH --account=default # Accounting/billing account

# Resource allocation
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --gres=gpu:h100:2
#SBATCH --time=02:00:00

# Environment setup
cd ~/ml-pipeline
source venv/bin/activate

# Execute the Python script
python scripts/03_training.py

Key Points:

  • %j in output path → replaced with job ID (e.g., 03_training_12345.out)
  • Environment must be activated in the script (Slurm doesn't inherit your shell)
  • Exit code of the Python script becomes the job's exit code

Pipeline Stages

Stage 1: Environment Setup

Purpose: Validate environment, download and cache model/dataset.

Script: scripts/01_setup.py
Batch Job: jobs/01_setup.sbatch

What it does:

  1. Validates CUDA/GPU availability
  2. Installs Python dependencies (if needed)
  3. Downloads model from HuggingFace Hub → caches locally
  4. Downloads dataset → caches locally
  5. Saves configuration for subsequent stages

Resource Requirements:

#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:h100:1 # Need 1 GPU to validate CUDA
#SBATCH --time=00:30:00

Why run this as a separate job?

  • Downloads can fail (network issues) → retry without re-running training
  • Model caching saves time on subsequent pipeline runs
  • Validates environment before committing to long training

Stage 2: Data Preprocessing

Purpose: Transform raw data into training-ready format.

Script: scripts/02_preprocessing.py
Batch Job: jobs/02_preprocessing.sbatch

What it does:

  1. Loads raw dataset (prompt-completion pairs)
  2. Applies tokenizer chat template
  3. Prompt masking: Sets labels to -100 for prompt tokens (model only learns to predict responses)
  4. Packing: Concatenates samples into fixed 1024-token blocks (maximizes GPU utilization)
  5. Saves processed data to disk

Resource Requirements:

#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:0 # No GPU needed - CPU-bound
#SBATCH --time=00:30:00

Why no GPU?

  • Tokenization is CPU-bound
  • Frees GPUs for other users while preprocessing runs

Stage 3: Model Training

Purpose: Fine-tune the model using QLoRA.

Script: scripts/03_training.py
Batch Job: jobs/03_training.sbatch

What it does:

  1. Loads base model with 4-bit quantization (QLoRA)
  2. Applies LoRA adapters (trainable parameters)
  3. Loads preprocessed data
  4. Runs distributed training across GPUs (DDP)
  5. Saves checkpoints and final model

Resource Requirements:

#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:h100:2 # Both GPUs for DDP
#SBATCH --time=02:00:00

Training Configuration:

  • Batch size: 1 per GPU
  • Gradient accumulation: 8 steps
  • Effective batch size: 1 × 2 GPUs × 8 = 16
  • Learning rate: 2e-4
  • Max steps: 200

Stage 4: Model Evaluation

Purpose: Evaluate fine-tuned model quality.

Script: scripts/04_evaluation.py
Batch Job: jobs/04_evaluation.sbatch

What it does:

  1. Loads base model + LoRA adapters
  2. Generates responses for test samples
  3. Calculates perplexity (model confidence)
  4. Calculates BERTScore (semantic similarity)
  5. Saves metrics and sample outputs

Resource Requirements:

#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:h100:1 # Single GPU for inference
#SBATCH --time=00:30:00

Metrics Explained:

MetricWhat it MeasuresGood Value
PerplexityModel confidence (lower = better)< 10
BERTScore F1Semantic similarity to expected> 0.7

Running the Pipeline

All pipeline commands must be executed from the VM where Slurm is installed. The pipeline code is already deployed at ~/ml-pipeline/.

Full Pipeline Execution

The recommended way to run the pipeline is using the submission script that chains all jobs.

Step 1: SSH into the VM (from your local machine)

ssh ubuntu@<your-vm-ip> -i ~/.ssh/id_rsa

Step 2: Navigate to pipeline directory (on the VM)

cd ~/ml-pipeline

Step 3: Ensure environment is set up (on the VM)

# Create .env with your HuggingFace token (if not already done)
echo "HF_TOKEN=your_token_here" > .env
chmod 600 .env

# Create logs directory (if not exists)
mkdir -p logs

# Create Python virtual environment (first time only)
# python3 -m venv venv
# source venv/bin/activate
# pip install transformers datasets accelerate peft bitsandbytes bert-score torch

# Activate Python virtual environment
source venv/bin/activate

Step 4: Run the full pipeline (on the VM)

# Make script executable (only needed once)
chmod +x full_ml_pipeline.sh

# Run the pipeline
./full_ml_pipeline.sh

Expected Output:

==============================================
SUBMITTING ML PIPELINE (DEPENDENT CHAIN)
==============================================
Submitted Job 1 (Setup): 100
Submitted Job 2 (Preprocessing): 101 (depends on 100)
Submitted Job 3 (Training): 102 (depends on 101)
Submitted Job 4 (Evaluation): 103 (depends on 102)
==============================================
Pipeline submitted successfully!
Use 'squeue' to monitor the progress.
==============================================

What happens next:

  • All 4 jobs are in the queue
  • Job 1 starts immediately
  • Jobs 2-4 wait with (Dependency) reason
  • As each job completes, the next one starts

Individual Job Execution

For debugging or re-running specific stages (all commands on the VM):

# On the VM
cd ~/ml-pipeline
source venv/bin/activate

# Run only setup
sbatch jobs/01_setup.sbatch

# Run only preprocessing (after setup completed)
sbatch jobs/02_preprocessing.sbatch

# Run only training (after preprocessing completed)
sbatch jobs/03_training.sbatch

# Run only evaluation (after training completed)
sbatch jobs/04_evaluation.sbatch

Re-running a failed stage:

# Check which job failed
sacct --format=JobID,JobName,State,ExitCode -X

# Example: Training failed, re-run it
sbatch jobs/03_training.sbatch

# Then re-run evaluation with dependency on new training job
JOB_ID=$(sbatch --parsable jobs/03_training.sbatch)
sbatch --dependency=afterok:$JOB_ID jobs/04_evaluation.sbatch

Monitoring

CLI Monitoring (in the VM)

These commands are run on the VM via SSH.

Watch job queue in real-time:

# On the VM
watch -n 5 squeue -u $USER

Example output during pipeline run:

JOBID  PARTITION  NAME           STATE     TIME    NODELIST    REASON
100 gpu 01_setup COMPLETED 0:02:15 virtual-m
101 gpu 02_preprocess RUNNING 0:05:32 virtual-m
102 gpu 03_training PENDING 0:00 (Dependency)
103 gpu 04_evaluation PENDING 0:00 (Dependency)

View job logs while running (on the VM):

# Stream current output
tail -f ~/ml-pipeline/logs/03_training_102.out

# View completed job output
cat ~/ml-pipeline/logs/03_training_102.out

Check job history (on the VM):

# Today's jobs
sacct --format=JobID,JobName,State,Elapsed,ExitCode -X

# Specific job details
sacct -j 102 --format=JobID,JobName,State,AllocCPUS,AllocTRES,Elapsed

Grafana Dashboard (via SSH Tunnel)

Grafana runs on the VM but is not directly accessible from the internet due to firewall restrictions. Use SSH tunneling to securely access it from your local machine.

Prerequisites: Ensure monitoring services are running (on the VM)

# Check if services are running
sudo systemctl status slurm-exporter
sudo docker ps # Should show prometheus, grafana, dcgm-exporter

# Start if needed
sudo systemctl start slurm-exporter
cd ~/monitoring && sudo docker compose up -d

Why SSH Tunneling?

  • The VM's ports 3000 (Grafana) and 9090 (Prometheus) are blocked by the cloud firewall
  • SSH tunneling forwards these ports through your existing SSH connection
  • Your browser connects to localhost:3000 which is tunneled to the VM's Grafana

Step 1: Create SSH tunnel (run on your LOCAL machine, not the VM)

Open a new terminal on your local machine and run:

ssh -L 3000:localhost:3000 -L 9090:localhost:9090 ubuntu@<your-vm-ip> -i ~/.ssh/id_rsa -N

Important: Keep this terminal open for the duration of your monitoring session. The -N flag means no remote command is executed - it just forwards ports.

Step 2: Access Grafana (on your LOCAL machine)

Open your browser and go to: http://localhost:3000

  • Username: admin
  • Password: slurm123

Step 3: View "Slurm + GPU Monitoring" dashboard

Available Panels:

PanelDescription
GPU Utilization GaugesReal-time % usage per GPU
GPU Memory UsedMemory consumption over time
Jobs Running/PendingCurrent queue status
Current Job RuntimesBar chart of active job durations
Job Wait TimesHow long jobs waited in queue

What to expect during training:

GPU Utilization Timeline:
─────────────────────────────────────────────────▶ time

GPU 0: ░░░░▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░
Setup │ Training (50-60%) │ Eval

GPU 1: ░░░░▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░
idle │ Training (50-60%) │ idle