Cluster hardware overview
Denvr Cloud clusters are hosted on Equinix Fabric which provides internet-based access, as well as direct-connect and VPN secure private access.
MSC1 (Calgary)
The
MSC1 environment (Modular SuperCluster 1) contains 500+ NVIDIA
A100 GPUs and non-blocking InfiniBand for demanding HPC, machine learning,
and scientific workloads. User application containers run directly on bare-metal hosts.
Features
- AMD EPYC 3rd Generation Zen3 processors (7003 series)
- Up to 8x NVIDIA A100 (40GB) Tensor Core GPUs
- Support for Multi-Instance GPU (MIG) with 5GB and 20GB partition sizes
- 800 Gbps non-blocking InfiniBand for multi-node training and distributed compute
On-Demand and Reserved nodes
Node Types
| GPUs
| vCPUs
| Memory
(GB)
| GPU-to-GPU
Bandwidth
| Network
Bandwidth
| Local storage
|
NVIDIA A100 (40GB) - NvLink
| 8
| 128
| 1,024
| 600 GB/s
| 800G
| 15.4 TB NVMe SSD
|
NVIDIA A100 (40GB) - PCIe
| 4
| 64
| 512
| 64 GB/s
| 10G
| 7.68 TB NVMe SSD
|
CPU Only
| -
| 256
| 1,024
| -
| 10G
| 7.68 TB NVMe SSD |
Storage tiers
- Local scratch - lowest latency for scratch, model checkpoints, and training data
- Performance - high IIOPS network attached SSD for Application storage, large working datasets, and model checkpoints
- Bulk/archive - Qumulo hybrid HDD/SSD for protected long term file storage
LAB
The LAB environment is a small cluster for technology preview before general availability in Denvr Cloud MSC clusters.
Features
- AMD EPYC 3rd Generation Zen3 processors (7003 series)
- NVIDIA A100 (40GB), NVIDIA A40 (48GB), and AMD Mi210 (64GB)
- Up to 800 Gbps non-blocking InfiniBand for multi-node training and distributed compute
Reserved nodes
Node Types
| GPUs
| vCPUs
| Memory
(GB)
| GPU-to-GPU
Bandwidth
| Network
Bandwidth
| Local storage
|
NVIDIA A100 (40GB) - NvLink
| 8
| 128
| 1,024
| 600 GB/s
| 800G | 15.4 TB NVMe SSD
|
NVIDIA A100 (40GB) - PCIe
| 4
| 64
| 512
| 64 GB/s
| 10G
| 7.68 TB NVMe SSD
|
NVIDIA A40 (48GB)
| 4
| 64
| 256
| 64 GB/s
| 10G
| 7.68 TB NVMe SSD |
AMD Mi210 (64GB)
| 2
| 64
| 256
| 64 GB/s
| 10G
| 7.68 TB NVMe SSD |
Related Articles
Storage platform overview
The Denvr Cloud is integrated with a high-performance storage platform that maximizes overall system performance. There are three separate storage systems that work together to deliver throughput, parallel I/O, and high IOPS, as well as ...
Why A100?
NVIDIA A100 The NVIDIA A100 Tensor Core GPU delivers unprecedented acceleration to many domains - AI Training, AI Inference, HPC, and data analytics. The A100 is the current-generation engine of the NVIDIA data center platform and provides ...
Release Notes
Release Notes (July 15, 2023) We're excited to announce several updates to Denvr AI Cloud! Here's what's new: Virtual machines: UI and API is released to support virtual machine for 4-GPU A100 PCIe full nodes. VMs should be used by teams requiring ...
Application bundles
Overview Denvr Dataworks offers a new type of application deployment called "Application Bundles". An application bundles packages together all resources necessary to operate a user application, including: Application code in the form of containers ...