GIPDEEP – How to launch jobs

Slurm has two commands for launching jobs: srun and sbatch.
srun is straightforward: srun <arguments> <program>
You can use various arguments to control the resource allocation of your job, as well as other settings your job requires.
You can find the full list of available arguments here. Some of the most common arguments are:
-c # – allocate # CPUs per task.
‑‑gres=gpu:# – allocate # GPUs.
‑‑gres=gpu:<type>:# – allocate # GPUs of type <type>.
‑‑mpi=<mpi_type> – define the type of mpi to use. The default (when isn’t explicitly specified) is pmi2.
‑‑pty – run job in pseudo terminal mode. Useful when running bash.
-p <partition> – run job on the selected partition instead of the default one.
-w <node> – run job on a specific node.

Examples:
1. To get a shell with two GPUs, run:
srun -c 4 ‑‑gres=gpu:2 ‑‑pty bash
run ‘nvidia-smi’ to verify the job received two GPUs.
2. Run the script ‘script.py’ using python3:
srun python3 script.py

sbatch lets you use a Slurm script to run jobs.
Exemples:


#!/bin/bash
#SBATCH -c 8                      # number of cores (treats)
#SBATCH --gres=gpu:2080ti:1          # Request 1 gpu type 2080ti
#SBATCH --mail-user=[user]@cs.technion.ac.il
#SBATCH --mail-type=ALL           # Valid values are NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --job-name="JobName"
#SBATCH -o ./out_job%j.txt        # stdout goes to out_job.txt
#SBATCH -e ./err_job%j.txt        # stderr goes to err_job.txt module purge # clean active modules list module load matlab/R2023a # activate matlab 2023 module conda activate [your miniconda environment] python main.py # run your script

For several types you can mark it in “–gres” flag


#SBATCH --gres={gpu:A40:1,gpu:L40:1}

You can find more information here.

Selecting specific resources

It is possible to select specific nodes and GPU types when launching jobs:

Nodes:
Use the argument -w to select a specific node to run on. You can also specify a list of nodes. For example:
srun -w gipdeep4,gipdeep5 …

Get the list of available nodes and their state using the command sinfo -N.

Run the command snode to list the number of allocated, available and total CPUs and GPUs for every node in the cluster.

GPU type:
You can specify a specific GPU type using the ‑‑gres argument. For example:
srun ‑‑gres=gpu:titanx:2 …

List of GPU types and their codenames:

TeslaP100 – Tesla P100-PCIE-12GB
1080 – GeForce GTX 1080 8GB
1080ti – GeForce GTX 1080 Ti 11GB
titanx – GeForce GTX TITAN X 12GB
2080ti – GeForce RTX 2080 Ti 11GB
3090 – GeForce RTX 3090 24GB

You can check current GPU usage across all nodes using the command sgpu.
Use sgpu_stat <node> to view GPU usage for a specific node, i.e. sgpu_stat gipdeep1.
sgpu shows running processes on GPUs across all nodes. Even if a GPU doesn’t run a process, it doesn’t mean it’s available for scheduling. A job can allocate resources and not use them.

You can check current RAM and Swap utilization using the command smem.
Use smem_stat <node> to view GPU usage for a specific node, i.e. smem_stat gipdeep1.