MULTIPLE TRAINING PROCESSES ON A SINGLE GPU

If you recognize or calculate that your training batch occupies less than 40% of the available GPU memory, you may safely combine several independent training processes on a single GPU.

This approach allows optimal utilization of shared cluster resources by running multiple workloads concurrently within one GPU allocation, instead of leaving significant portions of GPU memory idle. By explicitly limiting per-process GPU memory usage and correctly sizing batch parameters, users can increase overall throughput, reduce queue wait times, and improve cluster efficiency without risking CUDA out-of-memory (OOM) failures.

This document describes the supported and recommended method for running multiple PyTorch training processes on a single GPU, including memory partitioning, batch sizing rules, checkpointing, and safe job restart behavior under Slurm.

  1. When and why to do this

Use multiple processes on one GPU when:

  • You have many small / medium models
  • Each model does not need the full GPU
  • You want better GPU utilization
  • You run under Slurm with 1 GPU allocation

Do NOT do this for:

  • Very large models
  • DistributedDataParallel (DDP)
  • Latency-critical inference
  1. Core principles (read this first)

GPU memory is NOT shared automatically

  • Slurm allocates GPUs, not memory
  • CUDA allows one process to OOM the entire GPU

PyTorch memory fraction is mandatory

To safely share one GPU:

torch.cuda.set_per_process_memory_fraction(fraction, device)

This enforces a hard upper bound per process.

  1. Recommended architecture

Slurm job (1 GPU)

└── Python launcher

├── Train process #1 (≈ ⅓ VRAM)

├── Train process #2 (≈ ⅓ VRAM)

└── Train process #3 (≈ ⅓ VRAM)

Required settings

  • multiprocessing start method: spawn
  • CUDA_VISIBLE_DEVICES=0
  • Explicit batch sizing
  1. Minimal PyTorch worker template

torch.cuda.set_device(0)

torch.cuda.set_per_process_memory_fraction(0.33, 0)

⚠️ Must be called before any CUDA tensor is created.

  1. Batch size sizing rule (most important part)

The memory formula

For training, GPU memory usage per process is approximately:

GPU_mem_per_process ≈  Model_params + Gradients + Optimizer_state + Activations(batch_size) + CUDA workspace

Rule-of-thumb formula

Let:

  • M_gpu = total GPU memory (GiB)
  • N_proc = number of processes
  • F = safety factor (0.85 recommended)
  • M_static = model + optimizer memory (GiB)
  • M_sample = activation memory per sample (GiB)

Then:

Available_per_process = (M_gpu / N_proc) × F

 

Max_batch_size ≈

(Available_per_process − M_static) / M_sampl

  1. Practical example (realistic numbers)

GPU: 48 GiB
Processes: 3
Safety factor: 0.85

Available_per_process ≈ (48 / 3) × 0.85 ≈ 13.6 GiB

Model:

  • Parameters + optimizer: 4.0 GiB
  • Activation per sample: 40 MB = 0.04 GiB

Max_batch ≈ (13.6 − 4.0) / 0.04 ≈ 240 samples

👉 Start with batch_size = 192
👉 Increase gradually if needed

  1. How to measure M_sample correctly

Use a probe batch:

torch.cuda.reset_peak_memory_stats()

loss = model(x[:probe_bs]).sum()

loss.backward()

peak = torch.cuda.max_memory_allocated() / 1024**3

Then:

M_sample ≈ (peak − M_static) / probe_bs

This gives accurate per-sample cost.

  1. Common causes of OOM (even with memory fraction)
Cause Fix
Batch too large Reduce batch
Activation checkpointing off Enable it
Fragmentation Set max_split_size_mb
CUDA init before spawn Fix order
Skip to content