DARWIN – Partitions, accounts and QoS

partition is a logical group of physical nodes. A node can be part of more than one partition. A slurm job will launch on a specific partition, indicated using the -p argument. If a partition is not specified, it will launch on the default partition.
Accounts are groups of users. On Newton, every user belongs to one (and only one) account as is selected by default. The account a user belongs to defines which partitions they can launch their jobs on. You can check your account by logging into Newton and running the command:

sacctmgr show user $USER withassoc format=user,account

Quality of Service (QoS) is another control group entity that helps the Fair Share queueing system and to define resource limits. Newton currently has only one QoS for all users and there are no resource limits. You can ignore QoS at this time.

We use the term Golden Ticket to describe priority over resources for a specific group of users on specific nodes, and we use Slurm’s partition system to define golden tickets.

Preemption is the process of deferring a job by a job with a higher priority over resources. The preemption method Darwin users is Requeue –

#SBATCH --requeue

meaning preempted jobs will be returned to the queue instead of being canceled or paused.

Partitions:
public: This partition includes all the nodes in the cluster. Every user can launch job on it. It is selected by default. Jobs on this partition can be preempted.
galileo: private partition, only usable by account galileo. Highest priority. Nodes:galileo3
bml: private partition, only usable by account bml. Highest priority. Nodes: socrates, protagoras
Accounts:
cs – public account. Can only launch jobs on the public partition.
bml – private account. Can launch jobs on the bml and public partitions .
galileo – private account. Can launch jobs on the galileo and public partitions .

The bottom line(s):
Selecting specific partition:
If you’re on the cs account, don’t indicate the working partition (public will be selected by default). Your job will enter the queue and be assigned resources when they become available. If your job has returned to the queue, it means that a higher priority job required the resources your job has been allocated. Your job has been requeued and will continue running when resources become available.

If you’re on a private account, indicate the partition you want to run on using the -p (partition) and -A (account) argument. Running on your private partition will give you priority over the cs account. Indicating the public partition (default) will queue your job for one of all the nodes, but the job could be preempted if a higher priority job enters the queue.
Example run for nlp account:

 srun -p bml -A bml -c 2 --pty bash 

 

Skip to content