A partition is a logical group of physical nodes. A node can be part of more than one partition. A slurm job will launch on a specific partition, indicated using the -p argument. If a partition is not specified, it will launch on the default partition.
Accounts are groups of users. On Newton, every user belongs to one (and only one) account as is selected by default. The account a user belongs to defines which partitions they can launch their jobs on. You can check your account by logging into Newton and running the command:
sacctmgr show user $USER withassoc format=user,account
Quality of Service (QoS) is another control group entity that helps the Fair Share queueing system and to define resource limits.
We use the term Golden Ticket to describe priority over resources for a specific group of users on specific nodes, and we use Slurm’s partition system to define golden tickets.
Preemption is the process of deferring a job by a job with a higher priority over resources. The preemption method Newton uses is Requeue –
#SBATCH --requeue
meaning preempted jobs will be returned to the queue instead of being canceled or paused.
Partitions:
all: This partition includes all the nodes in the cluster. Every user can launch job on it. It is selected by default. Jobs on this partition can be preempted.
cath: This partition have a same as public partition priority . Nodes:gipdeep[1-4,6]
cactus (2):private partition, only usable by account cactus (2). Highest priority. Node:gipdeep8
dekel (2):private partition, only usable by account dekel (2). Highest priority. Node:gipdeep9
gipmed (2):private partition, only usable by account gipmed (2). Highest priority. Node:gipdeep10
brosh :private partition, only usable by account brosh. Highest priority. Node:gipdeep11
Accounts:
research – public account. Can only launch jobs on the “all” partition.
cathalert – Account for cathalert research group for gipdeep[1-4,6] servers.
cactus – Private account. Can launch jobs on the server gipdeep8 with hight priority for this server.
cactus2 – Private account. Can launch jobs on the server gipdeep8 with highest priority for this server.
dekel – Private account. Can launch jobs on the server gipdeep9 with hight priority for this server.
dekel2 – Private account. Can launch jobs on the server gipdeep9 with highest priority for this server.
gipmed – Private account. Can launch jobs on the server gipdeep10 with hight priority for this server.
gipmed2 – Private account. Can launch jobs on the server gipdeep10 with highest priority for this server.
brosh – Private account. Can launch jobs on the server gipdeep11 with highest priority for this server.
Generally, research is the default account, all is the default partition and normal is the default QoS, and you don’t need to include those when running jobs. Only specific research groups have separate accounts and partitions.
The bottom line(s):
Selecting specific partition:
If you’re on the research account, don’t indicate the working partition (all will be selected by default). Your job will enter the queue and be assigned resources when they become available. If your job has returned to the queue, it means that a higher priority job required the resources your job has been allocated. Your job has been requeued and will continue running when resources become available.
If you’re on a private account, indicate the partition you want to run on using the -p (partition) and -A (account) argument. Running on your private partition will give you priority over the cs account. Indicating the public partition (default) will queue your job for one of all the nodes, but the job could be preempted if a higher priority job enters the queue.
Example run for cathalert account:
srun -p cath -A cathalert -c 2 --gres:gpu=1 --pty bash
Selecting specific nodes:
It is possible to specify nodes for a job. This is done using the argument -w and a coma-separated list of nodes. The node list must be a subset of the partition’s nodes (i.e. you can’t choose TeslaP100 when working on the cactus partition, for example).
Use case for choosing a specific node:
Each job must be assigned with at least one CPU (CPU, in this case, is a processor thread). When using GPUs, the CPU-GPU affinity must be considered. When all of a node’s GPUs are in use, sometimes not all of its CPUs are. If your job requires only CPUs and no GPUs, it’s worth asking for such a node – this could reduce the possibility that your job will be preempted.
Use the command snode to view the current resource usage of each node in the cluster.
Choose a node using -w <node>
e.g. srun -w gipdeep4 …
# or several machines
srun -w gipdeep[1,2],gipdeep6 ...
CPU-GPU affinity in a nutshell:
GPUs communicate with CPUs using the PCIe bus. On a single socket system, the CPU controls all the PCIe lanes. On a dual-socket system (Newton’s nodes are all dual-socket), each CPU controls (usually) half of the total PCIe lanes. Usually, CPU0 controls GPU0-3 while CPU1 controls GPU4-7 (on an 8 GPU system).
There may be situations where a node has free (unallocated) GPUs, but not enough (or not at all) CPU threads to fulfill a job’s requirements, thus making the appearance where a job is waiting for available resources when there seems like there already are.
Consider this case: A node with 2 CPUs, each with 20 threads, and 8 GPUs. The nodes run 2 jobs, each requires 12 threads and 2 GPUs. The first jobs
is allocated cores 0-11 on CPU0 and also GPUs 0 and 1. The second job also requires 12 threads and 2 GPUs. GPUs 2 and 3 are available, but CPU0 has not enough free threads (only 12-19 are free). The job is then allocated with threads 0-11 on CPU1 and GPUs 4 and 5 (remember, CPU1 controls GPUs 4-7). At this point, a third job is launched which requires the same amount of resources: 12 threads and 2 GPUs. Overall, the server has 16 free threads (8 on each CPU) and 4 GPUs (2 on each CPU), but no single CPU has enough free threads to fulfill the job’s resource requirements, forcing the job to stay in queue until one of the two original jobs finishes.
Situations like these are one of the reasons that several smaller jobs are preferable to one big job – it allows for better resource allocation.
Job and resource limits
Users in research account are limited to 4 GPUs and 40 CPUs at 10 days maximum for, without limit on the number of jobs.