Cluster hardware overview

Cluster hardware overview

Denvr Cloud clusters are hosted on Equinix Fabric which provides internet-based access, as well as direct-connect and VPN secure private access.

MSC1 (Calgary)

The MSC1 environment (Modular SuperCluster 1) contains 500+ NVIDIA A100 GPUs and non-blocking InfiniBand for demanding HPC, machine learning, and scientific workloads.  User application containers run directly on bare-metal hosts.

Features

  1. AMD EPYC 3rd Generation Zen3 processors (7003 series)
  2. Up to 8x NVIDIA A100 (40GB) Tensor Core GPUs
  3. Support for Multi-Instance GPU (MIG) with 5GB and 20GB partition sizes
  4. 800 Gbps non-blocking InfiniBand for multi-node training and distributed compute

On-Demand and Reserved nodes

Node Types
GPUs
vCPUs
Memory
(GB)
GPU-to-GPU
Bandwidth
Network
Bandwidth
Local storage
NVIDIA A100 (40GB) - NvLink
8
128
1,024
600 GB/s
800G
15.4 TB NVMe SSD
NVIDIA A100 (40GB) - PCIe
4
64
512
64 GB/s
10G
7.68 TB NVMe SSD
CPU Only
-
256
1,024
-
10G
7.68 TB NVMe SSD

Storage tiers

  1. Local scratch - lowest latency for scratch, model checkpoints, and training data
  2. Performance - high IIOPS network attached SSD for Application storage, large working datasets, and model checkpoints
  3. Bulk/archive - Qumulo hybrid HDD/SSD for protected long term file storage

LAB

The LAB environment is a small cluster for technology preview before general availability in Denvr Cloud MSC clusters.

Features

  1. AMD EPYC 3rd Generation Zen3 processors (7003 series)
  2. NVIDIA A100 (40GB), NVIDIA A40 (48GB), and AMD Mi210 (64GB)
  3. Up to 800 Gbps non-blocking InfiniBand for multi-node training and distributed compute

Reserved nodes

Node Types
GPUs
vCPUs
Memory
(GB)
GPU-to-GPU
Bandwidth
Network
Bandwidth
Local storage
NVIDIA A100 (40GB) - NvLink
8
128
1,024
600 GB/s
800G
15.4 TB NVMe SSD
NVIDIA A100 (40GB) - PCIe
4
64
512
64 GB/s
10G
7.68 TB NVMe SSD
NVIDIA A40 (48GB)
4
64
256
64 GB/s
10G
7.68 TB NVMe SSD
AMD Mi210 (64GB)
2
64
256
64 GB/s
10G
7.68 TB NVMe SSD

    • Related Articles

    • Storage platform overview

      The Denvr Cloud is integrated with a high-performance storage platform that maximizes overall system performance. There are three separate storage systems that work together to deliver throughput, parallel I/O, and high IOPS, as well as ...
    • Why A100?

      NVIDIA A100 The NVIDIA A100 Tensor Core GPU delivers unprecedented acceleration to many domains - AI Training, AI Inference, HPC, and data analytics. The A100 is the current-generation engine of the NVIDIA data center platform and provides ...
    • Application bundles

      Overview Denvr Dataworks offers a new type of application deployment called "Application Bundles". An application bundles packages together all resources necessary to operate a user application, including: Application code in the form of containers ...
    • Release Notes

      Release Notes (July 15, 2023) We're excited to announce several updates to Denvr AI Cloud! Here's what's new: Virtual machines: UI and API is released to support virtual machine for 4-GPU A100 PCIe full nodes. VMs should be used by teams requiring ...