ML Pipeline Guide for Slurm

This is a guide for building and running ML pipelines with Slurm + Grafana + Prometheus.

Overview
Assumptions
Pipeline Architecture
Core Concepts
Pipeline Stages
Running the Pipeline
- Full Pipeline Execution
- Individual Job Execution
Monitoring
- CLI Monitoring
- Grafana Dashboard

Overview

This guide explains how to build and run ML training pipelines on Slurm-managed GPU clusters. The pipeline fine-tunes a large language model (20B parameters) using QLoRA and is structured as a chain of dependent Slurm jobs.

Code Repository: https://github.com/ori-edge/slurm-ml-pipelines

Key Features:

Modular 4-stage pipeline (Setup → Preprocessing → Training → Evaluation)
Automatic job sequencing via Slurm dependencies
GPU resource allocation per stage
Real-time monitoring with Prometheus + Grafana

Assumptions

This guide assumes the following are already in place:

Infrastructure

VM provisioned with SSH access (IP address known)
Slurm installed and configured (controller + compute node running)
GPUs configured in Slurm (2x H100 80GB with GRES setup)
CUDA drivers installed (nvidia-smi works)

Software

Python 3 installed with python3-venv package
Docker and Docker Compose installed
Monitoring stack deployed (Prometheus, Grafana, DCGM exporter containers)
Slurm exporter configured as a systemd service
Pipeline code already deployed at ~/ml-pipeline/ on the VM

Network & Access

SSH key configured at ~/.ssh/id_rsa for VM access
HuggingFace account with API token for model access

For initial Slurm setup, refer to the Slurm Setup Guide.

Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                        ML PIPELINE WORKFLOW                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   │   Job 1      │     │   Job 2      │     │   Job 3      │     │   Job 4      │
│   │   SETUP      │────▶│ PREPROCESS   │────▶│  TRAINING    │────▶│ EVALUATION   │
│   │              │     │              │     │              │     │              │
│   │  01_setup.py │     │02_preproc.py │     │03_training.py│     │04_eval.py    │
│   └──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
│         │                    │                    │                    │
│         ▼                    ▼                    ▼                    ▼
│   ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│   │ • Validate   │     │ • Load data  │     │ • Load model │     │ • Load model │
│   │   environment│     │ • Tokenize   │     │ • Apply LoRA │     │ • Generate   │
│   │ • Cache model│     │ • Pack seqs  │     │ • Train DDP  │     │ • Calc metrics│
│   │ • Cache data │     │ • Mask prompts│    │ • Save ckpts │     │ • Save results│
│   └──────────────┘     └──────────────┘     └──────────────┘     └──────────────┘
│                                                                         │
│   Resources:            Resources:            Resources:            Resources:
│   CPU: 8 cores          CPU: 16 cores         CPU: 32 cores         CPU: 8 cores
│   GPU: 1                GPU: 0 (CPU-only)     GPU: 2                GPU: 1
│   Time: 30 min          Time: 30 min          Time: 2 hours         Time: 30 min
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Core Concepts

Job Dependencies

Slurm job dependencies ensure jobs execute in the correct order. A dependent job remains in PENDING state until its prerequisite completes successfully.

For ML pipelines, use afterok - this ensures each stage only runs if the previous stage succeeded.

Creating a Job Chain:

# Submit Job 1 (no dependencies)
JOB_ID_1=$(sbatch --parsable jobs/01_setup.sbatch)

# Submit Job 2 (depends on Job 1)
JOB_ID_2=$(sbatch --parsable --dependency=afterok:$JOB_ID_1 jobs/02_preprocessing.sbatch)

# Submit Job 3 (depends on Job 2)
JOB_ID_3=$(sbatch --parsable --dependency=afterok:$JOB_ID_2 jobs/03_training.sbatch)

# Submit Job 4 (depends on Job 3)
JOB_ID_4=$(sbatch --parsable --dependency=afterok:$JOB_ID_3 jobs/04_evaluation.sbatch)

What happens:

All 4 jobs are submitted immediately to Slurm
Job 1 starts running (state: RUNNING)
Jobs 2, 3, 4 wait in queue (state: PENDING with reason: Dependency)
When Job 1 completes successfully → Job 2 starts
If any job fails → all subsequent jobs are cancelled

Resource Allocation

Each pipeline stage has different resource requirements. Slurm allocates exactly what each job needs.

Resource Directives in sbatch:

#SBATCH --cpus-per-task=32    # Number of CPU cores
#SBATCH --mem=64G             # Memory limit
#SBATCH --gres=gpu:h100:2     # GPU type and count
#SBATCH --time=02:00:00       # Maximum runtime (HH:MM:SS)

Resource Planning by Stage:

Stage	CPUs	Memory	GPUs	Time	Rationale
Setup	8	32GB	1	30min	Model download, validation
Preprocessing	16	64GB	0	30min	CPU-bound tokenization
Training	32	128GB	2	2h	Multi-GPU training
Evaluation	8	32GB	1	30min	Single-GPU inference

Why allocate different resources?

Efficiency: Don't waste GPUs on CPU-bound tasks (preprocessing)
Throughput: Other jobs can use freed resources
Cost: Optimise GPU time

Slurm Batch Scripts

Each pipeline stage has a corresponding .sbatch file that defines:

Resource requirements (SBATCH directives)
Environment setup (activate venv, set paths)
Script execution

Anatomy of a Batch Script:

#!/bin/bash
#SBATCH --job-name=03_training        # Job identifier
#SBATCH --output=logs/03_training_%j.out   # Stdout (%j = job ID)
#SBATCH --error=logs/03_training_%j.err    # Stderr
#SBATCH --partition=gpu               # Which partition to use
#SBATCH --account=default             # Accounting/billing account

# Resource allocation
#SBATCH --cpus-per-task=32
#SBATCH --mem=128G
#SBATCH --gres=gpu:h100:2
#SBATCH --time=02:00:00

# Environment setup
cd ~/ml-pipeline
source venv/bin/activate

# Execute the Python script
python scripts/03_training.py

Key Points:

%j in output path → replaced with job ID (e.g., 03_training_12345.out)
Environment must be activated in the script (Slurm doesn't inherit your shell)
Exit code of the Python script becomes the job's exit code

Pipeline Stages

Stage 1: Environment Setup

Purpose: Validate environment, download and cache model/dataset.

Script: scripts/01_setup.py
Batch Job: jobs/01_setup.sbatch

What it does:

Validates CUDA/GPU availability
Installs Python dependencies (if needed)
Downloads model from HuggingFace Hub → caches locally
Downloads dataset → caches locally
Saves configuration for subsequent stages

Resource Requirements:

#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:h100:1     # Need 1 GPU to validate CUDA
#SBATCH --time=00:30:00

Why run this as a separate job?

Downloads can fail (network issues) → retry without re-running training
Model caching saves time on subsequent pipeline runs
Validates environment before committing to long training

Stage 2: Data Preprocessing

Purpose: Transform raw data into training-ready format.

Script: scripts/02_preprocessing.py
Batch Job: jobs/02_preprocessing.sbatch

What it does:

Loads raw dataset (prompt-completion pairs)
Applies tokenizer chat template
Prompt masking: Sets labels to -100 for prompt tokens (model only learns to predict responses)
Packing: Concatenates samples into fixed 1024-token blocks (maximizes GPU utilization)
Saves processed data to disk

Resource Requirements:

#SBATCH --cpus-per-task=16
#SBATCH --gres=gpu:0          # No GPU needed - CPU-bound
#SBATCH --time=00:30:00

Why no GPU?

Tokenization is CPU-bound
Frees GPUs for other users while preprocessing runs

Stage 3: Model Training

Purpose: Fine-tune the model using QLoRA.

Script: scripts/03_training.py
Batch Job: jobs/03_training.sbatch

What it does:

Loads base model with 4-bit quantization (QLoRA)
Applies LoRA adapters (trainable parameters)
Loads preprocessed data
Runs distributed training across GPUs (DDP)
Saves checkpoints and final model

Resource Requirements:

#SBATCH --cpus-per-task=32
#SBATCH --gres=gpu:h100:2     # Both GPUs for DDP
#SBATCH --time=02:00:00

Training Configuration:

Batch size: 1 per GPU
Gradient accumulation: 8 steps
Effective batch size: 1 × 2 GPUs × 8 = 16
Learning rate: 2e-4
Max steps: 200

Stage 4: Model Evaluation

Purpose: Evaluate fine-tuned model quality.

Script: scripts/04_evaluation.py
Batch Job: jobs/04_evaluation.sbatch

What it does:

Loads base model + LoRA adapters
Generates responses for test samples
Calculates perplexity (model confidence)
Calculates BERTScore (semantic similarity)
Saves metrics and sample outputs

Resource Requirements:

#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:h100:1     # Single GPU for inference
#SBATCH --time=00:30:00

Metrics Explained:

Metric	What it Measures	Good Value
Perplexity	Model confidence (lower = better)	< 10
BERTScore F1	Semantic similarity to expected	> 0.7

Running the Pipeline

All pipeline commands must be executed from the VM where Slurm is installed. The pipeline code is already deployed at ~/ml-pipeline/.

Full Pipeline Execution

The recommended way to run the pipeline is using the submission script that chains all jobs.

Step 1: SSH into the VM (from your local machine)

ssh ubuntu@<your-vm-ip> -i ~/.ssh/id_rsa

Step 2: Navigate to pipeline directory (on the VM)

cd ~/ml-pipeline

Step 3: Ensure environment is set up (on the VM)

# Create .env with your HuggingFace token (if not already done)
echo "HF_TOKEN=your_token_here" > .env
chmod 600 .env

# Create logs directory (if not exists)
mkdir -p logs

# Create Python virtual environment (first time only)
# python3 -m venv venv
# source venv/bin/activate
# pip install transformers datasets accelerate peft bitsandbytes bert-score torch

# Activate Python virtual environment
source venv/bin/activate

Step 4: Run the full pipeline (on the VM)

# Make script executable (only needed once)
chmod +x full_ml_pipeline.sh

# Run the pipeline
./full_ml_pipeline.sh

Expected Output:

==============================================
SUBMITTING ML PIPELINE (DEPENDENT CHAIN)
==============================================
Submitted Job 1 (Setup): 100
Submitted Job 2 (Preprocessing): 101 (depends on 100)
Submitted Job 3 (Training): 102 (depends on 101)
Submitted Job 4 (Evaluation): 103 (depends on 102)
==============================================
Pipeline submitted successfully!
Use 'squeue' to monitor the progress.
==============================================

What happens next:

All 4 jobs are in the queue
Job 1 starts immediately
Jobs 2-4 wait with (Dependency) reason
As each job completes, the next one starts

Individual Job Execution

For debugging or re-running specific stages (all commands on the VM):

# On the VM
cd ~/ml-pipeline
source venv/bin/activate

# Run only setup
sbatch jobs/01_setup.sbatch

# Run only preprocessing (after setup completed)
sbatch jobs/02_preprocessing.sbatch

# Run only training (after preprocessing completed)
sbatch jobs/03_training.sbatch

# Run only evaluation (after training completed)
sbatch jobs/04_evaluation.sbatch

Re-running a failed stage:

# Check which job failed
sacct --format=JobID,JobName,State,ExitCode -X

# Example: Training failed, re-run it
sbatch jobs/03_training.sbatch

# Then re-run evaluation with dependency on new training job
JOB_ID=$(sbatch --parsable jobs/03_training.sbatch)
sbatch --dependency=afterok:$JOB_ID jobs/04_evaluation.sbatch

Monitoring

CLI Monitoring (in the VM)

These commands are run on the VM via SSH.

Watch job queue in real-time:

# On the VM
watch -n 5 squeue -u $USER

Example output during pipeline run:

JOBID  PARTITION  NAME           STATE     TIME    NODELIST    REASON
  gpu        01_setup       COMPLETED 0:02:15 virtual-m   
  gpu        02_preprocess  RUNNING   0:05:32 virtual-m   
  gpu        03_training    PENDING   0:00    (Dependency)
  gpu        04_evaluation  PENDING   0:00    (Dependency)

View job logs while running (on the VM):

# Stream current output
tail -f ~/ml-pipeline/logs/03_training_102.out

# View completed job output
cat ~/ml-pipeline/logs/03_training_102.out

Check job history (on the VM):

# Today's jobs
sacct --format=JobID,JobName,State,Elapsed,ExitCode -X

# Specific job details
sacct -j 102 --format=JobID,JobName,State,AllocCPUS,AllocTRES,Elapsed

Grafana Dashboard (via SSH Tunnel)

Grafana runs on the VM but is not directly accessible from the internet due to firewall restrictions. Use SSH tunneling to securely access it from your local machine.

Prerequisites: Ensure monitoring services are running (on the VM)

# Check if services are running
sudo systemctl status slurm-exporter
sudo docker ps  # Should show prometheus, grafana, dcgm-exporter

# Start if needed
sudo systemctl start slurm-exporter
cd ~/monitoring && sudo docker compose up -d

Why SSH Tunneling?

The VM's ports 3000 (Grafana) and 9090 (Prometheus) are blocked by the cloud firewall
SSH tunneling forwards these ports through your existing SSH connection
Your browser connects to localhost:3000 which is tunneled to the VM's Grafana

Step 1: Create SSH tunnel (run on your LOCAL machine, not the VM)

Open a new terminal on your local machine and run:

ssh -L 3000:localhost:3000 -L 9090:localhost:9090 ubuntu@<your-vm-ip> -i ~/.ssh/id_rsa -N

Important: Keep this terminal open for the duration of your monitoring session. The -N flag means no remote command is executed - it just forwards ports.

Step 2: Access Grafana (on your LOCAL machine)

Open your browser and go to: http://localhost:3000

Username: admin
Password: slurm123

Step 3: View "Slurm + GPU Monitoring" dashboard

Available Panels:

Panel	Description
GPU Utilization Gauges	Real-time % usage per GPU
GPU Memory Used	Memory consumption over time
Jobs Running/Pending	Current queue status
Current Job Runtimes	Bar chart of active job durations
Job Wait Times	How long jobs waited in queue

What to expect during training:

GPU Utilization Timeline:
─────────────────────────────────────────────────▶ time

GPU 0: ░░░░▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░
       Setup │      Training (50-60%)     │ Eval

GPU 1: ░░░░▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓░░░░░░░░░░
       idle  │      Training (50-60%)     │ idle

ML Pipeline Guide for Slurm

Table of Contents

Overview

Assumptions

Infrastructure

Software

Network & Access

Pipeline Architecture

Core Concepts

Job Dependencies

Resource Allocation

Slurm Batch Scripts

Pipeline Stages

Stage 1: Environment Setup

Stage 2: Data Preprocessing

Stage 3: Model Training

Stage 4: Model Evaluation

Running the Pipeline

Full Pipeline Execution

Individual Job Execution

Monitoring

CLI Monitoring (in the VM)

Grafana Dashboard (via SSH Tunnel)

Contents

Table of Contents​

Overview​

Assumptions​

Infrastructure​

Software​

Network & Access​

Pipeline Architecture​

Core Concepts​

Job Dependencies​

Resource Allocation​

Slurm Batch Scripts​

Pipeline Stages​

Stage 1: Environment Setup​

Stage 2: Data Preprocessing​

Stage 3: Model Training​

Stage 4: Model Evaluation​

Running the Pipeline​

Full Pipeline Execution​

Individual Job Execution​

Monitoring​

CLI Monitoring (in the VM)​

Grafana Dashboard (via SSH Tunnel)​

Contents

Table of Contents

Overview

Assumptions

Infrastructure

Software

Network & Access

Pipeline Architecture

Core Concepts

Job Dependencies

Resource Allocation

Slurm Batch Scripts

Pipeline Stages

Stage 1: Environment Setup

Stage 2: Data Preprocessing

Stage 3: Model Training

Stage 4: Model Evaluation

Running the Pipeline

Full Pipeline Execution

Individual Job Execution

Monitoring

CLI Monitoring (in the VM)

Grafana Dashboard (via SSH Tunnel)