Joshua Hartman’s Post

View profile for Joshua Hartman, graphic

Director of Engineering, Databricks ML Platform - I'm hiring!

Super excited to share some of the techniques we are using for efficient LLM inference! Many thanks to Daya Khudia and the team for this hard work. Some quick things to know: 1) Use a good server like vllm or TRT-LLM 2) Good servers have continuous batching which will group incoming requests together on a per-token level to balance latency and throughput 3) At lower batch sizes, inference latency is determined primarily by the amount of memory bandwidth. At high batch sizes it is mainly a factor of FLOPS. 4) Mixture of Expert models are awesome because fewer parameters are in use at a given time, reducing demands on memory bandwidth and FLOPs. 5) Quantization helps with little accuracy loss since it can cut memory bandwidth in proportion to the quantization size. Also FP8 on newer cards can double the effective FLOPs! https://lnkd.in/gszuGAjz

Fast, Secure and Reliable: Enterprise-grade LLM Inference

Fast, Secure and Reliable: Enterprise-grade LLM Inference

databricks.com

Hyunho Yeo

ML Infrastructure Engineer at Moloco | KAIST PhD

5mo

Thank you for the great article! Has your team also tried GPU-friendly attention kernels (e.g., FlashDecoding developed by TogetherAI)?

Like
Reply

To view or add a comment, sign in

Explore topics