Skip to content

Baskerville System

The Baskerville system is a Lenovo® cluster solution and supplied by OCF. The system was deployed and integrated and is managed by Advanced Research Computing at the University of Birmingham.

Compute nodes

The system comprises 57 SD650-N V2 liquid cooled compute trays with

  • 2x Intel® Xeon® Platinum 8360Y CPUs, each with 36 cores at 2.4GHz (with boost to 3.5GHz)
  • 512GB RAM (16x 32GB DDR4)
  • 1TB NVMe M.2 device (used for OS and available as /scratch-local)
  • 1x 25GbE NVIDIA® Mellanox® (on-planar ConnectX-4 port)
  • 1x HDR (200Gbps) NVIDIA Mellanox Infiniband port (ConnectX-6 PCIe gen4 adapter)
  • NVIDIA HGX-100 GPU planar
  • 4x NVIDIA A100 GPUs

The GPUs on 11 nodes have 80GB RAM; those on the remaining 46 nodes have 40GB RAM.

The GPUs are meshed using Nvidia NVLINK. Full details of the architecture are provided in the Lenovo documentation.

The compute trays are all direct liquid cooled using Lenovo Neptune™ to provide dense computing.

Benchmarking nodes

There are 2 nodes with:

  • 2x AMD EPYC® 9554 CPUs, each with 64 cores at 3.1GHz (with boost to 3.75GHz)
  • 768GB RAM
  • 4x NVIDIA H100 80G GPUs

These nodes are available for existing Baskerville users. Please contact us if you wish to access these nodes.

Installed software

The H100 nodes do not have the centrally installed, module based, software stack available. Anyone accessing these nodes will need to install their own software.

To submit a job that will use the H100 GPUs you will need to add both a reservation and a constraint:

#SBATCH --reservation=NvidiaH100
#SBATCH --constraint=h100_80

Large memory jobs

The default 3 GB memory per task will not grant you the maximum memory assigning all the threads on the node you have to set the --mem separately

Global storage

The system is equipped with Lenovo DSS-G storage systems running IBM® Spectrum Scale™:

  • 1x DSS-G250 equipped with 418x 16TB HDD
  • 1x DSS-G204 equipped with 48x 7.68TB SSD

Two file-systems are deployed:

  • /bask - general storage for home directory and project bulk data storage
  • /scratch-global - transient storage on SSD enclosures available on all compute systems

Quota

User home directories are limited to 20GB hard limit. Home directory space is provided for login type scripts and it is expected that all code and data will be placed in a project space.

Project space by default is allocated at 1TB, however additional quota for projects can be requested and is allocated based on justified need.

Network

Baskerville uses three Networks:

  • isolated 1GbE management network (not user accessible)
  • 25GbE high speed Ethernet network
  • HDR fat tree Infiniband network

Infiniband network

The HDR fat tree Infiniband network is built using NVIDIA Mellanox Quantum HDR switches (QM8790).

All compute systems use ConnectX-6 PCIe gen 4 adapters which provides a full 200Gb network connection. Architecturally, this is connected to Socket 1 on the system planar.

Login, management and storage systems are PCIe gen3 attached and provide HDR-100 connectivity to the fabric. Storage nodes all use multiple HDR-100 ports and use the Spectrum Scale verbsPorts option to enable the ports. RDMA is also enabled to the Spectrum Scale storage systems.

IPoIB is also deployed on the Infiniband network.

25GbE high speed Ethernet network

The high speed network is built using NVIDIA Mellanox Spectrum®-2 switches running NVIDIA Cumulus® Linux.

Trademarks and registered trademarks are owned by their respective companies.