Storage platform overview

Storage platform overview

The Denvr Cloud is integrated with a high-performance storage platform that maximizes overall system performance.  There are three separate storage systems that work together to deliver throughput, parallel I/O, and high IOPS, as well as fault-tolerant protection.


User Storage

Features:
  1. Petabyte-scale RAID-protected storage, optimized for low cost, parallel I/O, and persistent storage
  2. Up to 1 GB/s of read throughput
  3. Encrypted at rest and delivered to GPUs over tenant-private networking
Primary use:
  1. Used for home directory files, application code and configuration, and datasets
  2. Clients may optionally prefer to keep data in their own on-prem storage systems and leverage Denvr for caching 
  3. GPUs can process data directly from User Storage, but may benefit from Cache storage depending on your use

Performance Storage

Features:
  1. Petabyte-scale network-attached Flash SSD with configurable data protection
  2. Up to 5 GB/s read throughput 
  3. Encrypted at rest and delivered to GPUs over tenant-private networking
Primary use:
  1. Working space for very large datasets that are copied in and out for local Cache
  2. Application file systems, databases, and large capacity/high IOPS requirements
  3. Faster version of User Storage but is typically used for pre-processing and streaming large datasets in and out of Cache

Cache Storage

Features:
  1. Up to 16 TB of direct-attached NVMe SSD per bare metal node
  2. Lowest latency and up to 12 GB/s read throughput
  3. Cache is non-persistent and is freed when applications are stopped
Primary uses:
  1. Volumes provide local storage for model training data, checkpoint files, and intermediate results data
  2. Data loaders should copy training data into the Cache Storage to ensure full GPU utilization and not bottleneck performance on slower network-attached storage

    • Related Articles

    • Cluster hardware overview

      Denvr Cloud clusters are hosted on Equinix Fabric which provides internet-based access, as well as direct-connect and VPN secure private access. MSC1 (Calgary) The MSC1 environment (Modular SuperCluster 1) contains 500+ NVIDIA A100 GPUs and ...
    • Transferring data files using JupyterLab

      Introduction This tutorial will demonstrate use of Jupyter Lab for file transfer. Files can be read/write to the operating system disks or the Denvr Storage platforms using the /data/ filesystem. Launch Jupyter application For additional ...
    • Transferring data files using SFTP

      Introduction This tutorial will demonstrate use of SFTP (Secure File Transfer Protocol) to transfer files into an application instance. Files can be read/write to the operating system disks or the Denvr Storage platforms using the /data/ filesystem. ...
    • Running your first application

      This tutorial will use the PyTorch 1.8.2 application which provides a Ubuntu command line and JupyterLab web interface for development. Select application bundle Navigate to 'Applications' then 'Bundles' and select an application to deploy. Name your ...
    • Why A100?

      NVIDIA A100 The NVIDIA A100 Tensor Core GPU delivers unprecedented acceleration to many domains - AI Training, AI Inference, HPC, and data analytics. The A100 is the current-generation engine of the NVIDIA data center platform and provides ...