Zachary Mueller’s Post

View profile for Zachary Mueller, graphic

Technical Lead for Accelerate at HuggingFace

Today is an extra-special release of Hugging Face Accelerate as we now have integrated PyTorch's revamped PiPPy library for a torch-native pipeline parallel inference solution! What is pipeline parallelism (PP) and how does it differ from distributed data parallelism (DDP)? PP allows you to shard a model across multiple devices while scheduling all the GPUs to be active during inference (rather than having each collection of GPUs process one batch) We've done so while also maintaining the `device_map="auto"` fashion. As long as all your GPUs can store the sharded model, you can utilize this technique! This can increase your throughput when a model is loaded on multiple GPUs by 40-50% at least (we tested on 2x4090s) Accelerate fashion, the code to utilize this is minimal. Just use a single wrapper and dictating the inputs to pass in (as it relies on tracing for setup) and perform inference like you would normally* #pytorch #huggingface #deployment #largelanguagemodels #pytorch

  • No alternative text description for this image
craig parsey

Heavy duty fitter and augmented intelligence embodied ai augmented reality

5mo

Can you shard a model and run it all on one gpu?

Like
Reply
Uygar Hizal

📊 Microsoft Certified Azure Data Scientist | 👁️🗨️ Azure AI Engineer | ✏️ Microsoft Certified Trainer

5mo

Great news. A question can we reach this with 3090s too ?

Like
Reply
Awais Nawaz

NLP Engineer @ VisionTech360

5mo

This is great, does this mean it explicitly implements FSDP?

Like
Reply
Louis Dansette

Responsable Webmarketing et Optimisation SEO

5mo

Great job on the integration! How does pipeline parallelism impact inference latency compared to DDP?

Like
Reply

Imagine doing this on 1.5tb of mi300x. 🤯

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics