Azure Optimized Stack with DeepSpeed for hyperscale model training

Azure Machine Learning (AzureML) now provides an optimized stack that uses NVIDIA’s latest GPU technology with Quantum InfiniBand to efficiently train and tune large models such as Megatron-Turing and GPT-3.

In recent years, deep learning models based on large-scale transformers trained on large amounts of data are used for new products and various cognitive tasks. These models have grown in size and magnitude and customer training and tuning needs have grown accordingly.

Training and fitting these types of models requires a complex and distributed architecture, and configuring these architectures requires multiple manual and error-prone steps. With this new optimized stack, AzureML enables a better experience in terms of usability and performance, providing a simple-to-use training pipeline. AzureML’s proposed stack includes: hardware, operating system, VM image, Docker image (with optimized PyTorch, DeepSpeed, ONNX Runtime and other Python packages) for performance and scalability without complexity.

Optimized stack for scalable distributed training on Azure

A possible experimental setup is composed of the NDm A100 v4 series that includes two 64-core AMD EPYC 7V12 CPUs, 1.7 TB of main memory and eight 80 GB A100 GPUs. A balanced PCIe topology is used to connect 4 GPUs to each CPU and each GPU has its own topology agnostic 200 Gb/s NVIDIA Mellanox HDR InfiniBand. 1.7TB of main memory and DeepSpeed library download capabilities allow scaling to large models. This setup can be used in both AzureML Studio and Azure VMSS, but the AzureML Studio solution is recommended because it’s the easiest way to get the correct and simple setup up and running.

Differences between the distributed architecture and the AzureML training configuration

AzureML’s proposed stack enables efficient training of 2x larger model sizes (2 trillion vs. 1 trillion parameters), scaling to 2x more GPUs (1024 vs. 512), and compute/GPU performance up to at 1.8 times greater (150 TFLOP vs. 81 TFLOP). This stack also has the ability to offer near-linear scalability in terms of increasing model size and increasing the number of GPUs. Thanks to DeepSpeed ZeRO-3 with its CPU offloading capabilities and this new AzureML stack, the efficient 157 TFLOP performance/GPU is maintained as the model scales from 175 billion to 2 trillion parameters and, given the size of a model (eg 175 billion in the following graph), linear scaling is achieved if the number of GPUs increases.

More detailed results are described in the deepspeed extended technical blog.

a. performance/GPU vs. model size from 175 billion to 2 trillion parameters (BS/GPU=8),

b. Linear scales performance with increasing number of GPU devices for the 175B model (BS/GPU=16).

Leave a Comment Cancel Reply