Job Submission

Further information on job submission can be found within the Baskerville job documentation

Slurm¶

Baskerville uses the scheduler Slurm to submit jobs. Slurm has a wide array of features – we’ll look at a few but for more information please see the Slurm website.

The primary method for submitting jobs is by using a batch script. The various Slurm options are passed via header lines that begin with #SBATCH, for example:

#SBATCH --account _projectaccount_  # Only required if you are a member of more than one Baskerville project
#SBATCH --qos _qos_  # upon signing-up to Baskerville you will be assigned a qos 
#SBATCH --time days-hours:minutes:seconds  # Time assigned for the simulation
#SBATCH --nodes n  # Normally set to 1 unless your job requires multi-node, multi-GPU
#SBATCH --gpus n  # Resource allocation on Baskerville is primarily based on GPU requirement
#SBATCH --cpus-per-gpu 36  # This number should normally be fixed as "36" to ensure that the system resources are used effectively
#SBATCH --job-name _jobname_  # Title for the job

The --time option

For convenience, the --time option can be expressed in multiple formats (in addition to the one detailed above):

35 – a single numerical value is treated as minutes.
1:30 – two colon-separated values are treated as minutes:seconds.
3:45:0 – three colon-separated values are treated as hours:minutes:seconds.

TaskSolution

Build a submission script requiring 2 GPUs and a wall-time (time-limit) of 2 days, 5 hours and 30 minutes. Please also specify the job name.

#!/bin/bash

#SBATCH --account _projectaccount_
#SBATCH --qos _qos_
#SBATCH --time 2-5:30:0
#SBATCH --nodes 1
#SBATCH --gpus 2
#SBATCH --cpus-per-gpu 36
#SBATCH --job-name _jobname_

Monitoring Jobs¶

Slurm provides several commands that you can use to inspect and/or adjust your running jobs. For example you can:

Find out how many jobs you have running
See the nodes your jobs are running on
See how long a job has been running for
See how much time a job has left
Cancel running jobs

Info and details on the these commands can be found on their help pages:

squeue – https://slurm.schedmd.com/squeue.html
scontrol – https://slurm.schedmd.com/scontrol.html
scancel – https://slurm.schedmd.com/scancel.html

QuestionsAnswer

You have submitted 2 jobs, which have Slurm IDs 7123 and 7235.

What command would you use to see all jobs for your user account?
What command would you use to inspect only job 7123?
What command would you use to cancel job 7235?

squeue
squeue -j 7123 or scontrol show job 7123
scancel 7235

Watching job progress

When you submit a job to Slurm it creates a file titled slurm-xxxxxx.stats (where “xxxxxx” represents the job’s ID); when the job starts running it creates a further file titled slurm-xxxxxx.out. You can watch the progress of a running job by executing the following command:

tail -f slurm-xxxxxx.out

Alternatively you can view the entire contents of these files in your terminal by using, for example, the cat command:

cat slurm-xxxxxx.stats

N.B. whilst the stats file shows useful information such as the amount of CPU, memory and time consumed by a job, it doesn’t show how much GPU resource was used beyond what was requested/allocated.

QOSes¶

Slurm uses QOSes (also referred to as “queues”) to ensure the equitable distribution of resources across the system. Further details can be found in the main documentation on Baskerville projects and QOSes.

QuestionAnswer

How can you find information on your available QOSes?

The following are valid methods for querying your available QOSes:

Look at the confirmation email that you received when your Baskerville account was created.
See your project(s) and their associated QOSes at https://admin.baskerville.ac.uk
Execute the my_baskerville command in a terminal shell on Baskerville.

GPUs¶

General information on Baskerville’s GPUs (in addition to info on CPUs, storage etc.) can be found on our system architecture page whilst specific information on the A100 GPUs can be found on the relevant Nvidia pages.

Quick GPU walk-through¶

This task looks at analysing the effects of GPUs using the routine cudaOpenMP. Further information on CUDA’s library samples can be found on the Getting CUDA Samples page from NVidia’s website.

In order to run the cudaOpenMP command it is first necessary to retrieve and compile the CUDA samples. This process is relatively simple as each version of the CUDA modules on Baskerville contains a command to unpack the sample sources into a specified directory, following which we can use the make command to then build them. Please therefore follow the preparatory steps outlined below prior to commencing the tasks:

Use module spider or query https://apps.baskerville.ac.uk to determine the available versions of the fosscuda module.

Write a batch script (named, e.g, samples.sh) to accomplish the following tasks:

Load the required fosscuda (and therefore CUDA) module.
Run the cuda-install-samples-11.1.sh command, passing an argument to specify the installation directory.
Change directory (cd) to where the samples were unpacked.
Run the make command to build the necessary tools.

Expand to view example

#!/bin/bash

#SBATCH --account _projectaccount_
#SBATCH --qos _userqos_
#SBATCH --time 0-0:60:0
#SBATCH --nodes 1
#SBATCH --gpus 1
#SBATCH --cpus-per-gpu 36

set -x

module purge; module load baskerville
module load bask-apps/live
module load fosscuda/2020b

# Run the unpack command to extract the sources into the current working directory
cuda-install-samples-11.1.sh .
# Navigate to the sources directory and make using the available resource
cd NVIDIA_CUDA-11.1_Samples && make -j ${SLURM_CPUS_ON_NODE}

Submit the above script to Slurm using the sbatch command. Once the job has completed you will have the cudaOpenMP binary required to run the tasks below: relative to the NVIDIA_CUDA-11.1_Samples directory, the file’s path is as follows: ./bin/x86_64/linux/release/cudaOpenMP.

TasksSolutions

Summary

Write and submit a batch file to run the cudaOpenMP command. It should specify the following details:

an appropriate account
an appropriate QOS
a job-name
wall-time of 10 minutes
1 node
2 GPUs

What is your output file?
Change GPUs to 4. What happens and why?
Change GPUs to 8. What happens and why?
Change nodes to 2 (whilst retaining 8 GPUs). What happens and why?

Refer to the Monitoring Jobs section for info on Watching job progress and how to read the .output and .stats files.

Submission script and associated output file:

Submission scriptOutput

#!/bin/bash

#SBATCH --account _projectaccount_
#SBATCH --qos _userqos_
#SBATCH --time 0:10:0
#SBATCH --nodes 1
#SBATCH --gpus 2
#SBATCH --cpus-per-gpu 36
#SBATCH --job-name _jobname_

module purge; module load baskerville
module load bask-apps/live 
module load fosscuda/2020b

./cudaOpenMP

number of host CPUs:    72
number of CUDA devices: 2
0: A100-SXM4-40GB
1: A100-SXM4-40GB
---------------------------
CPU thread 0 (of 2) uses CUDA device 0
CPU thread 1 (of 2) uses CUDA device 1
---------------------------

The output changes as follows, representing the increase in the reported GPUs and a proportional increase in the host CPUs.

number of host CPUs:    144
number of CUDA devices: 4
0: A100-SXM4-40GB
1: A100-SXM4-40GB
2: A100-SXM4-40GB
3: A100-SXM4-40GB
---------------------------
CPU thread 0 (of 4) uses CUDA device 0
CPU thread 1 (of 4) uses CUDA device 1
CPU thread 2 (of 4) uses CUDA device 2
CPU thread 3 (of 4) uses CUDA device 3
---------------------------

The sbatch command rejects the job with the following error:
```
sbatch: error: Batch job submission failed: Requested node configuration is not available
```
Baskerville’s compute nodes each have 4 NVidia A100 GPUs. By increasing the GPU request to “8” whilst still restricting the job to a single node (with --nodes 1) the job is unable to run within Baskerville’s configuration. You should always keep record of the architecture at hand and the associated max number of GPUs and CPUs available on a node.

See below for the .out and .stats files. The output from cudaOpenMP shows a lot less resource than what was actually requested and as can be seen in the stats file (4 GPUs and 144 CPUs vs 8 GPUs, 288 CPUs and 2 nodes). What we’re seeing is that the resource was divided equally between the two nodes but that cudaOpenMP is only designed to operate on a single node and is therefore reporting on half of the allocated resource. This demonstrates the importance of requesting an amount of resource that is appropriate for the application you are running and therefore ensuring that resource does not sit idle whilst still being allocated.

slurm.outslurm.stats

number of host CPUs:    144
number of CUDA devices: 4
0: A100-SXM4-40GB
1: A100-SXM4-40GB
2: A100-SXM4-40GB
3: A100-SXM4-40GB
---------------------------
CPU thread 0 (of 4) uses CUDA device 0
CPU thread 3 (of 4) uses CUDA device 3
CPU thread 2 (of 4) uses CUDA device 2
CPU thread 1 (of 4) uses CUDA device 1
---------------------------

+--------------------------------------------------------------------------+
| Job on the Baskerville cluster:
| Starting at Tue Sep 28 15:00:30 2021 for auser(123456)
| Identity jobid 12345 jobname cudaopenmp.sh
| Running against project ace-project and in partition baskerville-shared
| Requested cpu=288,mem=864G,node=2,billing=288,gres/gpu=8 - 00:10:00 walltime
| Assigned to nodes bask-pg0308u30a,bask-pg0308u31a
| Command /bask/projects/a/ace-project/cudaopenmp.sh
| WorkDir /bask/projects/a/ace-project
+--------------------------------------------------------------------------+
+--------------------------------------------------------------------------+
| Finished at Tue Sep 28 15:00:35 2021 for auser(123456) on the Baskerville Cluster
| Required (00:01.942 cputime, 3580K memory used) - 00:00:05 walltime
| JobState COMPLETING - Reason None
| Exitcode 0:0
+--------------------------------------------------------------------------+

CUDA routines and samples

Refer to NVidia’s Samples Reference documentation for further info on the CUDA routines and samples.

Interactive Jobs and Baskerville Portal¶

Baskerville Portal is the recommended method for running interactive jobs on Baskerville. For information on other methods of interactive jobs, please refer to the Interactive Jobs section of the documentation.

Transferring Data¶

Our recommended command-line tools for transferring data to/from the Baskerville cluster are as follows:

rsync – see the rsync man page for usage instructions; for general guidance please see the following webpage: https://linuxize.com/post/how-to-use-rsync-for-local-and-remote-data-transfer-and-synchronization
scp – see the scp man page for usage instructions; for general guidance please see the following webpage: https://linuxize.com/post/how-to-use-scp-command-to-securely-transfer-files

Alternatively, Baskerville Portal includes a file management web-interface that can be used to upload and download content from directories to which you have access.