Microsoft AI Research Introduces DeepSpeed-MII, A New Open-Source Python Library From DeepSpeed That Speeds Up 20,000+ Widely Used Deep Learning Models

Unprecedented scalability and speed for DL Training and Inference are made possible by the user-friendly deep learning optimization software suite known as DeepSpeed.



Reshape Large Model Training Landscape

DeepSpeed provides a confluence of system advancements that have changed the landscape of large-scale DL training in terms of scale that is feasible, dramatically enhanced simplicity of use, and made DL training successful and efficient. The DeepSpeed-Training pillar includes technologies like ZeRO, 3D Parallelism, DeepSpeed-MoE, ZeRO-Infinity, etc.

Optimize Large Model Inference

To enable inference at an unprecedented scale while achieving unmatched latency, throughput, and cost reduction, DeepSpeed combines innovations in parallelism technology such as tensor, pipeline, expert, and ZeRO-parallelism with high-performance custom inference kernels, communication optimizations, and heterogeneous memory technologies. The DeepSpeed-Inference encompasses this organized arrangement of system technologies for inference.     

Speed Up Inference & Reduce Model Size

DeepSpeed provides simple, adaptable strategies for academics and practitioners to compress their models, resulting in increased inference efficiency, faster speed, smaller model size, and much lower compression costs. Additionally, SoTA compression advancements like ZeroQuant and XTC are covered by the DeepSpeed-Compression pillar.

DeepSpeed Model Implementations for Inference (MII)

With up to 40x cheaper inference, 24,000+ open-source DL models instantly accelerate.

In the past several months, the Deep Learning (DL) open-source community has experienced rapid expansion. Through platforms like Hugging Face, anyone with access to a few or even a single GPU may now use incredibly potent text creation models like the Bloom 176B or image-generating models like Stable Diffusion. Open-sourcing has made AI capabilities more accessible to everyone, but two critical issues still limit their use: 1) inference delay and 2) cost.

Although they are difficult to access, system optimizations for DL model inference have made tremendous progress. These can significantly lower both latency and cost. The DL model inference landscape is heterogeneous, with models varying in size, architecture, system performance characteristics, hardware requirements, etc. This limited accessibility is primarily due to this diversity. Low latency and low-cost inference are mostly inaccessible because determining the right system improvements applicable to a specific model and implementing them appropriately frequently fall outside the purview of most data scientists.

A new open-source Python toolkit from DeepSpeed called DeepSpeed-MII aims to make low-latency, low-cost inferences of potent models that are both practical and accessible.

  •  MII provides access to thousands of widely used DL models in highly optimized implementations.
  • In comparison to their initial implementation, models supported by MII achieve much lower latency and cost.
  •  MII lowers the Big-Science Bloom 176B model's latency by 5.7x while lowering the cost by almost 40x.
  •  MII 1.9 times lower the delay and expense of deploying Stable Diffusion (right).
  •  MII uses a wide range of optimizations from DeepSpeed-Inference, including deep fusion for transformers, automated tensor-slicing for multi-GPU inference, on-the-fly quantization with ZeroQuant, and several more, to enable low latency/cost inference.
  •  MII allows low-cost deployment of these models via AML on-premises and Azure with state-of-the-art performance with just a few lines of code.

How does MII work?

The MII Architecture demonstrates how MII uses DS-Inference to automatically improve OSS models before GRPC or AML Inference deployment on-premises or in Microsoft Azure.

DeepSpeed-Inference powers MII from the inside out. To reduce latency and increase throughput, MII automatically applies the right combination of DeepSpeed-Inference system optimizations based on the model type, model size, batch size, and hardware resources available. Utilizing one of the numerous pre-defined model injection rules, MII and DeepSpeed-Inference are able to recognize the underlying PyTorch model architecture and swap it out for an optimized implementation (see Figure 1). By doing this, MII automatically makes the vast array of improvements in DeepSpeed-Inference available for the thousands of widely used models that it supports.

Supported Models and Tasks

Numerous open-source model repositories, like Hugging Face, FairSeq, EluetherAI, etc., offer thousands of transformer models that MII supports for a growing number of tasks, including text generation, question-answering, text categorization, etc. It supports dense models with a few hundred million to hundreds of billions of parameters built on the BERT, RoBERTa, GPT, OPT, and BLOOM architectures. It also supports current picture creation models like stable diffusion.

Inference Optimizations with MII

Here, we give a summary of the extensive collection of DeepSpeed-Inference optimizations that MII has made available.

DeepFusion for Transformers:

The transformer kernels in DeepSpeed-Inference, which are optimized to achieve low latency at small batch sizes and high throughput at large batch sizes using DeepFusion, are used by MII for transformer-based models like Bert, Roberta, GPT-2, and GPT-J.

Multi-GPU Inference with Tensor-Slicing:

To achieve the lowest latency and throughput relative to anything else now available for large models like Bloom 176B, MII automatically allows tensor-parallelism within a node to take advantage of aggregate memory bandwidth and computing over several GPUs.

INT8 Inference with ZeroQuant:

MII supports INT8 Inference with ZeroQuant for huge models with tens of billions or even hundreds of billions of parameters. By allowing bigger batch sizes and employing INT8 computing, using this feature not only decreases the memory footprint and the number of GPUs needed for inference but also boosts inference throughput while being less expensive than FP16.

ZeRO-Inference for Resource-Constrained Systems:

Even with INT8 compatibility, models like the Bloom 176B need more than 176 GB of RAM to accommodate the model. ZeRO-Inference, which can use the system CPU memory to deploy these huge models with a single GPU with limited memory, is made possible by MII in the absence of the aggregate GPU memory across several GPUs needed to deploy such models.

Compiler Optimizations:

To further reduce latency and increase performance, MII automatically incorporates compiler-based optimizations via TorchScript, nvFuser, and CUDA graph when appropriate.

Quantifying Latency and Cost Reduction

Workloads for inference might be either latency critical, where the main goal is to reduce latency, or cost sensitive, where the main goal is to reduce cost. We calculate the advantages of utilizing MII in this section for scenarios that are both cost- and latency-sensitive.

MII is compatible with two DeepSpeed-Inference variants. The first, known as ds-public, includes the majority of the above-discussed optimizations and is also accessible through our open-source DeepSpeed library. The second, ds-azure, is accessible via MII to all Microsoft Azure customers and offers closer integration with Azure. The DeepSpeed-Inference variations operating on MII are referred to as MII-Public and MII-Azure, respectively.

Both MII-Public and MII-Azure offer significant latency and cost reduction compared to open-sourced PyTorch implementation (Baseline), however for certain generative workloads, they can have differentiated performance. Here, we quantify the latency and cost reduction for both variations.

Usman Farooq

Post a Comment

Previous Post Next Post