NEWTON – Partitions, accounts and QoS

A partition is a logical group of physical nodes. A node can be part of more than one partition. A slurm job will launch on a specific partition, indicated using the -p argument. If a partition is not specified, it will launch on the default partition.
Accounts are groups of users. On Newton, every user belongs to one (and only one) account as is selected by default. The account a user belongs to defines which partitions they can launch their jobs on. You can check your account by logging into Newton and running the command:

sacctmgr show user $USER withassoc format=user,account

Quality of Service (QoS) is another control group entity that helps the Fair Share queueing system and to define resource limits. Newton currently has only one QoS for all users and there are no resource limits. You can ignore QoS at this time.

Newton’s nodes belong to staff members and research groups in the faculty. As such, researchers will always receive priority running jobs on their servers. Usage of other resources, including for users which don’t belong to any research group, are under the condition of availability.
The rationale behind creating Newton as a community cluster is to maximize resource utilization, while guaranteeing ‘private’ resource availability and providing computational resources for researchers who don’t have them.

We use the term Golden Ticket to describe priority over resources for a specific group of users on specific nodes, and we use Slurm’s partition system to define golden tickets.

Preemption is the process of deferring a job by a job with a higher priority over resources. The preemption method Newton uses is Requeue –

#SBATCH --requeue

meaning preempted jobs will be returned to the queue instead of being canceled or paused.

Partitions:
public: This partition includes all the nodes in the cluster. Every user can launch job on it. It is selected by default. Jobs on this partition can be preempted.
newton: faculty partition for account cslab. Nodes: newton3, newton4, newton5.
nlp: private partition, only usable by account nlp. Highest priority. Nodes: nlp-2080-[1-2],nlp-A40-1,nlp-ada-[1-2],nlp-l40-[1-2],nlp-pro6000-1
nlp-amd: private partition, only usable by account nlp-amd. Highest priority.Node: nlp-mi300
bml: private partition, only usable by account bml. Highest priority. Nodes: plato[1-2], plotinus[1,2] galileo: private partition, only usable by account galileo. Highest priority. Nodes:galileo1, galileo2, galileo4, galileo5
ran: private partition, only usable by account ran. Highest priority. Nodes:ran-mashawsha, entropy1,entropy2
dym:private partition, only usable by account dym-lab. Highest priority. Nodes:dym-lab,dym-lab2
espresso: private partition, only usable by account espresso. Highest priority. Nodes:bruno1,bruno2,bruno3,bruno4,bruno5
euler:private partition, only usable by account euler. Highest priority. Nodes:euler1,euler2
tdk:private partition, only usable by account tdk. Highest priority. Nodes:tdk-bm4
ash:private partition, only usable by account ash. Highest priority. Nodes:chuck1,chuck2
houdini:private partition, only usable by account houdini. Highest priority. Nodes:houdini
Accounts:
cs – public account. Can only launch jobs on the public partition.
cslab – public account for newton3 newton4 newton5 servers.
nlp – private account. Can launch jobs on the nlp and public partitions .
nlp-amd – private account. Can launch jobs on nlp-amd partitions .
ailon-lab – private account. Can launch jobs on the ailon-lab and public partitions .
bml – private account. Can launch jobs on the bml and public partitions .
galileo – private account. Can launch jobs on the galileo and public partitions .
ran – private account. Can launch jobs on the ran and public partitions .
dym-lab – private account. Can launch jobs on the dym-lab and public partitions .
espresso – private account. Can launch jobs on the espresso and public partitions .
euler – private account. Can launch jobs on the euler and public partitions .
ash – private account. Can launch jobs on the euler and public partitions .
tdk – private account. Can launch jobs on the tdk and public partitions .
houdini – private account. Can launch jobs on the houdini and public partitions .

The bottom line(s):
Selecting specific partition:
If you’re on the cs account, don’t indicate the working partition (public will be selected by default). Your job will enter the queue and be assigned resources when they become available. If your job has returned to the queue, it means that a higher priority job required the resources your job has been allocated. Your job has been requeued and will continue running when resources become available.

If you’re on a private account, indicate the partition you want to run on using the -p (partition) and -A (account) argument. Running on your private partition will give you priority over the cs account. Indicating the public partition (default) will queue your job for one of all the nodes, but the job could be preempted if a higher priority job enters the queue.
Example run for nlp account:

 srun -p nlp -A nlp -c 2 --gres:gpu=1 --pty bash

Selecting specific nodes:
It is possible to specify nodes for a job. This is done using the argument -w and a coma-separated list of nodes. The node list must be a subset of the partition’s nodes (i.e. you can’t choose isl-titan when working on the nlp partition, for example).
Use case for choosing a specific node:
Each job must be assigned with at least one CPU (CPU, in this case, is a processor thread). When using GPUs, the CPU-GPU affinity must be considered. When all of a node’s GPUs are in use, sometimes not all of its CPUs are. If your job requires only CPUs and no GPUs, it’s worth asking for such a node – this could reduce the possibility that your job will be preempted.
Use the command snode to view the current resource usage of each node in the cluster.
Choose a node using -w <node>

e.g. srun -w nlp-2080-1 …
# or several machines
srun -w nlp-2080-[1,2],newton1 ...

CPU-GPU affinity in a nutshell:
GPUs communicate with CPUs using the PCIe bus. On a single socket system, the CPU controls all the PCIe lanes. On a dual-socket system (Newton’s nodes are all dual-socket), each CPU controls (usually) half of the total PCIe lanes. Usually, CPU0 controls GPU0-3 while CPU1 controls GPU4-7 (on an 8 GPU system).
There may be situations where a node has free (unallocated) GPUs, but not enough (or not at all) CPU threads to fulfill a job’s requirements, thus making the appearance where a job is waiting for available resources when there seems like there already are.

Consider this case: A node with 2 CPUs, each with 20 threads, and 8 GPUs. The nodes run 2 jobs, each requires 12 threads and 2 GPUs. The first jobs
is allocated cores 0-11 on CPU0 and also GPUs 0 and 1. The second job also requires 12 threads and 2 GPUs. GPUs 2 and 3 are available, but CPU0 has not enough free threads (only 12-19 are free). The job is then allocated with threads 0-11 on CPU1 and GPUs 4 and 5 (remember, CPU1 controls GPUs 4-7). At this point, a third job is launched which requires the same amount of resources: 12 threads and 2 GPUs. Overall, the server has 16 free threads (8 on each CPU) and 4 GPUs (2 on each CPU), but no single CPU has enough free threads to fulfill the job’s resource requirements, forcing the job to stay in queue until one of the two original jobs finishes.
Situations like these are one of the reasons that several smaller jobs are preferable to one big job – it allows for better resource allocation.

Job and resource limits

CS account on public partition have a limit 8 GPU and 300 CPU per user.
CSLAB account on newton partition have a limit 8 GPU and 20 day run for a job.