Zachary Mueller’s Post

Technical Lead for Accelerate at HuggingFace

5mo

Today is an extra-special release of Hugging Face Accelerate as we now have integrated PyTorch's revamped PiPPy library for a torch-native pipeline parallel inference solution! What is pipeline parallelism (PP) and how does it differ from distributed data parallelism (DDP)? PP allows you to shard a model across multiple devices while scheduling all the GPUs to be active during inference (rather than having each collection of GPUs process one batch) We've done so while also maintaining the `device_map="auto"` fashion. As long as all your GPUs can store the sharded model, you can utilize this technique! This can increase your throughput when a model is loaded on multiple GPUs by 40-50% at least (we tested on 2x4090s) Accelerate fashion, the code to utilize this is minimal. Just use a single wrapper and dictating the inputs to pass in (as it relies on tracing for setup) and perform inference like you would normally* #pytorch #huggingface #deployment #largelanguagemodels #pytorch

17 Comments

Zachary Mueller

Technical Lead for Accelerate at HuggingFace

5mo

6 Reactions

craig parsey

Heavy duty fitter and augmented intelligence embodied ai augmented reality

5mo

Can you shard a model and run it all on one gpu?

Uygar Hizal

📊 Microsoft Certified Azure Data Scientist | 👁️🗨️ Azure AI Engineer | ✏️ Microsoft Certified Trainer

5mo

Great news. A question can we reach this with 3090s too ?

Awais Nawaz

NLP Engineer @ VisionTech360

5mo

This is great, does this mean it explicitly implements FSDP?

Louis Dansette

Responsable Webmarketing et Optimisation SEO

5mo

Great job on the integration! How does pipeline parallelism impact inference latency compared to DDP?

Jon S.

5mo

Imagine doing this on 1.5tb of mi300x. 🤯

See more comments

To view or add a comment, sign in

More Relevant Posts

Farhan Ahmad

AI Researcher and Consultant at KnowChow
5mo
Report this post
With model weights being squeezed down to (almost) single bits and matrix multiplications giving way to signed addition, can machine learning be liberated from backprop and GPUs?
Like Comment
To view or add a comment, sign in
The Motley Fool

202,058 followers
9mo
Report this post
Have you heard of GPU's? 🤔👀 💡 They power up scientific experiments. 💻They decode data & manage data processing. ✅They help make Al training seamless and easy. In simple terms, GPUs are tech's heavy lifters. 💪 Learn more: https://lnkd.in/evYEWEH8 #gpustocks #gpu #themotleyfool
Like Comment
To view or add a comment, sign in
René van Bevern

Big data R&D leader
10mo
Report this post
In our newly published article, we show the feasibility and effect of implementing parallel data reduction for NP-hard optimization problems on GPUs. Actually, the article is not that new ... just after two years under review, it is now freely downloadable until Nov 14th via https://lnkd.in/dwb_SxDq #algorithm #kernelization #datareduction #NP
1 Comment
Like Comment
To view or add a comment, sign in
Tyler Whitehouse

NVIDIA Product Management
5mo
Report this post
AI Workbench is a transformative approach that makes it easy to work with GPUs and containers on your choice of system. Some of you may remember a previous product that I worked on. This one is so much better. Give it a try and let me know what you think!!!

Create, Share, and Scale Enterprise AI Workflows with NVIDIA AI Workbench, Now in Beta | NVIDIA Technical Blog
Like Comment
To view or add a comment, sign in
inVISION News

16,530 followers
4mo
Report this post
With the Akhet Server VarioFlex 5U with dual GPU and the high performance computing platform Akhet VarioScaler xI with multiple dual GPUs, Pyramid Computer GmbH is presenting two new systems at Embedded World that are optimized for machine learning applications. In addition to both systems, a camera bar is shown that includes two 48MP cameras and an ARM Cortex A73 with integrated NPU. The machine vision solution recognizes objects based on a trained model. https://lnkd.in/eBnaCeVc #machinevision #imageprocessing #edgecomputing #embeddedsystems #machinelearning #ai #deeplearning
Like Comment
To view or add a comment, sign in
Larry Hernandez

Head of AI/ML/Data Science Recruitment Search @RecruiterDNA
7mo
Report this post
Pushing the Limits of the Two-Tower Model 🏢 Pushing the limits of the Two Towers: A deep dive into the world of ML models with multiple GPUs and efficient training techniques. 🚀 Key points: 1️⃣ Two Tower architecture: A popular approach for building large scale machine learning models that can handle vast amounts of data. 🏋️♂️ 2️⃣ Scaling up with GPUs: Using multiple GPUs to speed up training and inference processes. 🖥️💨 3️⃣ Efficient training techniques: Exploring ways to optimize resource allocation and reduce training time. ⏱️⚡ 4️⃣ Challenges: Overcoming bottlenecks and synchronizing GPU usage for seamless performance. 🔄🎛️ 5️⃣ Improved performance: The benefits of leveraging two towers and optimizing GPU utilization for enhanced model accuracy. 👍⚙️ #MachineLearning #TwoTowerArchitecture #GPUComputing #EfficientTraining #ModelPerformance https://buff.ly/3GE0QgO
Like Comment
To view or add a comment, sign in
Travis Addair

Co-Founder & CTO at Predibase
1y Edited
Report this post
What’s the first problem you’re likely to encounter when fine-tuning an LLM? Out of memory error. No, not a CUDA OOM, just a regular host out of memory error. Why? Because before you even get to multi-GPU training with model parallel frameworks like Deepspeed, you need to load the pretrained checkpoint into host memory. To make matters worse for machines with multiple GPUs, because Deepspeed is also a data-parallel framework, you actually need to load the checkpoint into host memory once for each GPU in your job! Now training your 7B parameter Llama-2 model in float32 with 8 GPUs requires 7 * 4 * 8 = 224 GiB of host memory just to load it onto the GPUs. The good news is: if you use an optimized LLM training framework like Ludwig.ai, you can get the host memory overhead back down to a more reasonable 7 * 4 = 28 GiB of host memory even when training on multiple GPUs. How did we do it? By loading the model weights into memory once before training begins and inserting them as numpy arrays into the #Ray object store, we can then zero-copy read the weights directly from shared memory into each GPU worker process. You can read more about this approach here: https://lnkd.in/gX4xkHgb Doing this yourself would normally be a fair bit of cumbersome code, but in Ludwig you get it for free just by running with Ray as the backend runtime. Try it out for yourself: https://lnkd.in/gaiMg9aZ
14 Comments
Like Comment
To view or add a comment, sign in
Karoly Hamza

ADG Sr. Compositor @ Dneg | Vfx Machine Learning | Generative AI
8mo
Report this post
This is a quick demonstration of the capabilities of local GPUs with the currently available machine learning tools for consistent video.#stablediffusion #machinelearning #aiart

10 Comments
Like Comment
To view or add a comment, sign in

4,416 followers

148 Posts

View Profile Follow

Zachary Mueller’s Post

More Relevant Posts

Explore topics