This approach allows optimal utilization of shared cluster resources by running multiple workloads concurrently within one GPU allocation, instead of leaving significant portions of GPU memory idle. By explicitly limiting per-process GPU memory usage and correctly sizing batch parameters, users can increase overall throughput, reduce queue wait times, and improve cluster efficiency without risking CUDA out-of-memory (OOM) failures.
This document describes the supported and recommended method for running multiple PyTorch training processes on a single GPU, including memory partitioning, batch sizing rules, checkpointing, and safe job restart behavior under Slurm.
- When and why to do this
Use multiple processes on one GPU when:
- You have many small / medium models
- Each model does not need the full GPU
- You want better GPU utilization
- You run under Slurm with 1 GPU allocation
Do NOT do this for:
- Very large models
- DistributedDataParallel (DDP)
- Latency-critical inference
- Core principles (read this first)
GPU memory is NOT shared automatically
- Slurm allocates GPUs, not memory
- CUDA allows one process to OOM the entire GPU
PyTorch memory fraction is mandatory
To safely share one GPU:
torch.cuda.set_per_process_memory_fraction(fraction, device)
This enforces a hard upper bound per process.
- Recommended architecture
Slurm job (1 GPU)
└── Python launcher
├── Train process #1 (≈ ⅓ VRAM)
├── Train process #2 (≈ ⅓ VRAM)
└── Train process #3 (≈ ⅓ VRAM)
Required settings
- multiprocessing start method: spawn
- CUDA_VISIBLE_DEVICES=0
- Explicit batch sizing
- Minimal PyTorch worker template
torch.cuda.set_device(0)
torch.cuda.set_per_process_memory_fraction(0.33, 0)
⚠️ Must be called before any CUDA tensor is created.
- Batch size sizing rule (most important part)
The memory formula
For training, GPU memory usage per process is approximately:
GPU_mem_per_process ≈ Model_params + Gradients + Optimizer_state + Activations(batch_size) + CUDA workspace
Rule-of-thumb formula
Let:
- M_gpu = total GPU memory (GiB)
- N_proc = number of processes
- F = safety factor (0.85 recommended)
- M_static = model + optimizer memory (GiB)
- M_sample = activation memory per sample (GiB)
Then:
Available_per_process = (M_gpu / N_proc) × F
Max_batch_size ≈
(Available_per_process − M_static) / M_sampl
- Practical example (realistic numbers)
GPU: 48 GiB
Processes: 3
Safety factor: 0.85
Available_per_process ≈ (48 / 3) × 0.85 ≈ 13.6 GiB
Model:
- Parameters + optimizer: 4.0 GiB
- Activation per sample: 40 MB = 0.04 GiB
Max_batch ≈ (13.6 − 4.0) / 0.04 ≈ 240 samples
👉 Start with batch_size = 192
👉 Increase gradually if needed
- How to measure M_sample correctly
Use a probe batch:
torch.cuda.reset_peak_memory_stats()
loss = model(x[:probe_bs]).sum()
loss.backward()
peak = torch.cuda.max_memory_allocated() / 1024**3
Then:
M_sample ≈ (peak − M_static) / probe_bs
This gives accurate per-sample cost.
- Common causes of OOM (even with memory fraction)
| Cause | Fix | |
| Batch too large | Reduce batch | |
| Activation checkpointing off | Enable it | |
| Fragmentation | Set max_split_size_mb | |
| CUDA init before spawn | Fix order | |
