MULTIPLE TRAINING PROCESSES ON A SINGLE GPU

If you recognize or calculate that your training batch occupies less than 40% of the available GPU memory, you may safely combine several independent training processes on a single GPU.

This approach allows optimal utilization of shared cluster resources by running multiple workloads concurrently within one GPU allocation, instead of leaving significant portions of GPU memory idle. By explicitly limiting per-process GPU memory usage and correctly sizing batch parameters, users can increase overall throughput, reduce queue wait times, and improve cluster efficiency without risking CUDA out-of-memory (OOM) failures.

This document describes the supported and recommended method for running multiple PyTorch training processes on a single GPU, including memory partitioning, batch sizing rules, checkpointing, and safe job restart behavior under Slurm.

When and why to do this

Use multiple processes on one GPU when:

You have many small / medium models
Each model does not need the full GPU
You want better GPU utilization
You run under Slurm with 1 GPU allocation

Do NOT do this for:

Very large models
DistributedDataParallel (DDP)
Latency-critical inference

Core principles (read this first)

GPU memory is NOT shared automatically

Slurm allocates GPUs, not memory
CUDA allows one process to OOM the entire GPU

PyTorch memory fraction is mandatory

To safely share one GPU:

torch.cuda.set_per_process_memory_fraction(fraction, device)

This enforces a hard upper bound per process.

Recommended architecture

Slurm job (1 GPU)

└── Python launcher

├── Train process #1 (≈ ⅓ VRAM)

├── Train process #2 (≈ ⅓ VRAM)

└── Train process #3 (≈ ⅓ VRAM)

Required settings

multiprocessing start method: spawn
CUDA_VISIBLE_DEVICES=0
Explicit batch sizing

Minimal PyTorch worker template

torch.cuda.set_device(0)

torch.cuda.set_per_process_memory_fraction(0.33, 0)

⚠️ Must be called before any CUDA tensor is created.

Batch size sizing rule (most important part)

The memory formula

For training, GPU memory usage per process is approximately:

GPU_mem_per_process ≈ Model_params + Gradients + Optimizer_state + Activations(batch_size) + CUDA workspace

Rule-of-thumb formula

Let:

M_gpu = total GPU memory (GiB)
N_proc = number of processes
F = safety factor (0.85 recommended)
M_static = model + optimizer memory (GiB)
M_sample = activation memory per sample (GiB)

Then:

Available_per_process = (M_gpu / N_proc) × F

Max_batch_size ≈

(Available_per_process − M_static) / M_sampl

Practical example (realistic numbers)

GPU: 48 GiB
Processes: 3
Safety factor: 0.85

Available_per_process ≈ (48 / 3) × 0.85 ≈ 13.6 GiB

Model:

Parameters + optimizer: 4.0 GiB
Activation per sample: 40 MB = 0.04 GiB

Max_batch ≈ (13.6 − 4.0) / 0.04 ≈ 240 samples

👉 Start with batch_size = 192
👉 Increase gradually if needed

How to measure M_sample correctly

Use a probe batch:

torch.cuda.reset_peak_memory_stats()

loss = model(x[:probe_bs]).sum()

loss.backward()

peak = torch.cuda.max_memory_allocated() / 1024**3

Then:

M_sample ≈ (peak − M_static) / probe_bs

This gives accurate per-sample cost.

Common causes of OOM (even with memory fraction)

Cause	Fix
Batch too large	Reduce batch
Activation checkpointing off	Enable it
Fragmentation	Set max_split_size_mb
CUDA init before spawn	Fix order