Running Jobs
Jobs on Baskerville are under the control of the Slurm scheduling system. The scheduling system is configured to offer an equitable distribution of resources over time to all users. The key means by which this is achieved are:
- Jobs are scheduled according to the resources that are requested.
- Jobs are not necessarily run in the order in which they are submitted.
- Jobs requiring a large number of cores and/or long walltime will have to queue until the requested resources become available. The system will run smaller jobs, that can fit in available gaps, until all of the resources that have been requested for the larger job become available - this is known as backfill. Hence it is beneficial to specify a realistic walltime for a job so it can be fitted in the gaps.
Slurm Jobs¶
Here we give a quick introduction to Slurm commands. Those requiring more fine grain control should consult the relevant documentation. Jobs move through a (simplified!) lifecycle as follows;
---
title: Slurm job lifecycle
---
%%{init: {"flowchart": {"htmlLabels": false}} }%%
flowchart TD
cf["`**CONFIGURING (CF)** - job has been allocated resources, waiting for them to come ready`"]
pd["`**PENDING (PD)** - job is waiting for resources`"]
running["`**RUNNING (R)** - job is processing`"]
cg["`**COMPLETING (CG)** - job has finished and is cleaning up`"]
cd["`**COMPLETED (CD)** - job is stopping`"]
f["`**FAILED (F)** - job has failed, either due a slurm requirement, e.g. timeout, or due to a fail in the running code`"]
cf --> pd
pd --> running
running --> cg
cg --> cd
running --> f
Submitting a job¶
The command to submit a job is sbatch
. For example, to submit the set of
commands contained in the file myscript.sh
, use the command:
sbatch myscript.sh
The system will return a job number, for example:
Submitted batch job 55260
Slurm is aware of your current working directory when submitting the job so there is no need to manually specify it in the script.
Upon completion of the job, there will be two output files in the directory from which you submitted the job. These files, for job id 55260, are:
slurm-55260.out
- standard out and standard error outputslurm-55260.stats
- information about the job from Slurm
Cancelling a job¶
To cancel a queued or running job use the scancel
command and supply it with
the job ID that is to be cancelled. For example, to cancel the previous job:
scancel 55260
Monitoring Your Jobs¶
There are a number of ways to monitor the current status of your job. You can view what’s going on by issuing any one of the following commands:
squeue
is Slurm’s command for viewing the status of your jobs. This shows information such as the job’s ID and name, the QOS used (the “partition”, which will tell you the node type), the user that submitted the job, time elapsed and the number of nodes being used.scontrol
is a powerful interface that provides an advanced amount of detail regarding the status of your job. Theshow
command withinscontrol
can be used to view details regarding a specific job.
For example:
squeue
squeue -j 55620
scontrol show job 55620
Associate Jobs with Projects and QoS¶
Every job has to be associated with a project to ensure the equitable distribution of resources. Project owners and members will have been issued a project code for each registered project, and only usernames authorised by the project owner will be able to run jobs using that project code. Additionally, every job has to be associated with a QoS.
You can see what projects you are a member of, and what QoS are available to you, by running the command:
my_baskerville
If you are registered on more than one project then it should be specified using the --account
option followed by the
project code. For example, if your project is project-name then add the following line to your job script:
#SBATCH --account=_projectname_
You can specify using the --qos
option followed by the QoS name. For example, if the QoS is qos-name then add the
following line to your job script:
#SBATCH --qos=_qosname_
Array Jobs¶
Array jobs are an efficient way of submitting many similar jobs that perform the same work using the same script but on different data. Sub-jobs are the jobs created by an array job and are identified by an array job ID and an index. For example, if 55620_1 is an identifier, the number 55620 is a job array ID, and 1 is the sub-job.
Example array job
#!/bin/bash
#SBATCH --account=_projectname_
#SBATCH --qos=_qosname_
#SBATCH --time=5:0
#SBATCH --array=2-5%2
set -e
module purge
module load baskerville
echo "${SLURM_JOB_ID}: Job ${SLURM_ARRAY_TASK_ID} of ${SLURM_ARRAY_TASK_MAX} in the array"
In Slurm, there are different environment variables that can be used to dynamically keep track of these identifiers.
#SBATCH --array=2-5%2
tells Slurm that this job is an array job and that it should run 4 sub jobs (with IDs 2, 3, 4, 5). You can specify up to 4,096 array tasks in a single job (e.g.--array 1-4096
). The%
separator indicates the maximum number of jobs able to run simultaneously. In this case only 2 jobs at a time will run.SLURM_ARRAY_TASK_COUNT
will be set to the number of tasks in the job array, so in the example this will be 4.SLURM_ARRAY_TASK_ID
will be set to the job array index value, so in the example there will be 4 sub-jobs, each with a different value (from 2 to 5).SLURM_ARRAY_TASK_MIN
will be set to the lowest job array index value, so in the example this will be 2.SLURM_ARRAY_TASK_MAX
will be set to the highest job array index value, son in the example this will be 5.SLURM_ARRAY_JOB_ID
will be set to the job ID provided by running thesbatch
command.
Visit the Job Array Support section of the Slurm documentation for more details on how to carry out an array job.
Requesting GPUs¶
There are many methods for requesting GPUs for your job.
CUDA_VISIBLE_DEVICES
and Slurm Jobs
In a Slurm job, CUDA_VISIBLE_DEVICES
will index from 0
no matter which GPUs you have
been allocated on that node. So, if you request 2 GPUs on a node then you will always see
CUDA_VISIBLE_DEVICES=0,1
and these will be mapped to the GPUs allocated to your job.
Available GPUs¶
We have provided a helper script, called baskstatus
that provides information on current GPU availability on
Baskerville.
$ baskstatus
Current Baskerville GPU availability:
* 1 node with 1 x A100-40 available
* 2 nodes with 4 x A100-80 available
The information listed is current when it is run. These GPUs may be allocated to a job shortly after the command is run.
GPU Type¶
Baskerville has both A100-40GB and A100-80GB GPUs available. To request a specific GPU type for a job you should add a constraint to your job submission script:
#SBATCH --constraint=_feature_
where _feature_
is:
a100_40
for the A100-40GB GPU nodesa100_80
for the A100-80GB GPU nodes
If a job does not specify a GPU type, then the system will select the most appropriate. This means that a job may span GPU types.
GPU Type
If your job requires all GPUs to have the same amount of memory, either A100-40s or A100-80s then you must specify the appropriate feature.
Multi-GPU, Multi-Task, or Multi-Node Jobs¶
In the examples below we will only show the SBATCH
options related to requesting GPUs, tasks, and nodes.
In each case we report output showing GPU (PCI Bus address) and process mapping (using
gpus_for_tasks.cpp
from NERSC); and the value of
CUDA_VISIBLE_DEVICES
for each task. This was done using the following, with the relevant Slurm headers
added in the blank line:
#!/bin/bash
module purge
module load baskerville
module load fosscuda/2020b
g++ -o gpus -lmpi -lcuda -lcudart gpus_for_tasks.cpp
srun env|grep CUDA_VISIBLE_DEVICES
srun ./gpus
GPU Visibility to Tasks
By default, each task on a node will see all the GPUs allocated on that node to your job.
Further GPU information is available by adding srun nvidia-smi -L
to the above script.
All these examples use srun
to launch the individual processes. The behaviour of mpirun
is different
and you should confirm it works as you expect.
Single GPU, Single Task, Single Node
#SBATCH --gpus-per-task 1
#SBATCH --tasks-per-node 1
#SBATCH --nodes 1
Rank 0 out of 1 processes: I see 1 GPU(s).
0 for rank 0: 0000:31:00.0
CUDA_VISIBLE_DEVICES=0
Multi GPU, Single Task, Single Node
#SBATCH --gpus-per-task 3
#SBATCH --tasks-per-node 1
#SBATCH --nodes 1
Rank 0 out of 1 processes: I see 3 GPU(s).
0 for rank 0: 0000:31:00.0
1 for rank 0: 0000:4B:00.0
2 for rank 0: 0000:CA:00.0
CUDA_VISIBLE_DEVICES=0,1,2
Single GPU, Multi Task, Single Node
#SBATCH --gpus-per-task 1
#SBATCH --tasks-per-node 2
#SBATCH --nodes 1
Rank 0 out of 2 processes: I see 1 GPU(s).
0 for rank 0: 0000:31:00.0
Rank 1 out of 2 processes: I see 1 GPU(s).
1 for rank 1: 0000:4B:00.0
CUDA_VISIBLE_DEVICES=0
CUDA_VISIBLE_DEVICES=0
Multi GPU, Multi Task, Single Node
#SBATCH --gpus-per-task 2
#SBATCH --tasks-per-node 2
#SBATCH --nodes 1
Rank 0 out of 2 processes: I see 2 GPU(s).
0 for rank 0: 0000:31:00.0
1 for rank 0: 0000:4B:00.0
Rank 1 out of 2 processes: I see 2 GPU(s).
2 for rank 1: 0000:CA:00.0
3 for rank 1: 0000:E3:00.0
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1
The --gpu-bind
option can be used to restrict the visibility of GPUs to the tasks
Multi GPU, Single Task, Multi Node with GPU/Task Binding
#SBATCH --gpus-per-task 2
#SBATCH --gpu-bind=map_gpu:0,1,2,3
#SBATCH --tasks-per-node 1
#SBATCH --nodes 2
Rank 0 out of 2 processes: I see 2 GPU(s).
0 for rank 0: 0000:31:00.0
1 for rank 0: 0000:4B:00.0
Rank 1 out of 2 processes: I see 2 GPU(s).
0 for rank 1: 0000:31:00.0
1 for rank 1: 0000:4B:00.0
CUDA_VISIBLE_DEVICES=0,1
CUDA_VISIBLE_DEVICES=0,1
Multiple GPUs, Multiple Tasks Per Node, and --gpu-bind
Requesting multiple GPUs per task, multiple tasks per node, and using --gpu-bind
is not supported by Slurm. Instead you will need to programmatically map the
correct GPUs to tasks.
Job and resource limits¶
When submitting a job there are some limits imposed on what you can request:
- The maximum duration you can request is 10 days.
- You are limited to 8 nodes (32 GPUs) for a single job.
If you submit a request that exceeds these constraints the job will be rejected immediately upon submission. Please contact us if you would like to run jobs of this kind.