Today is an extra-special release of Hugging Face Accelerate as we now have integrated PyTorch's revamped PiPPy library for a torch-native pipeline parallel inference solution! What is pipeline parallelism (PP) and how does it differ from distributed data parallelism (DDP)? PP allows you to shard a model across multiple devices while scheduling all the GPUs to be active during inference (rather than having each collection of GPUs process one batch) We've done so while also maintaining the `device_map="auto"` fashion. As long as all your GPUs can store the sharded model, you can utilize this technique! This can increase your throughput when a model is loaded on multiple GPUs by 40-50% at least (we tested on 2x4090s) Accelerate fashion, the code to utilize this is minimal. Just use a single wrapper and dictating the inputs to pass in (as it relies on tracing for setup) and perform inference like you would normally* #pytorch #huggingface #deployment #largelanguagemodels #pytorch
Can you shard a model and run it all on one gpu?
Great news. A question can we reach this with 3090s too ?
This is great, does this mean it explicitly implements FSDP?
Great job on the integration! How does pipeline parallelism impact inference latency compared to DDP?
Imagine doing this on 1.5tb of mi300x. 🤯
Technical Lead for Accelerate at HuggingFace
5moTo read more about this new interface, check out the documentation and example zoo: https://huggingface.co/docs/accelerate/usage_guides/distributed_inference#memory-efficient-pipeline-parallelism-experimental https://github.com/huggingface/accelerate/tree/main/examples/inference