Super excited to share some of the techniques we are using for efficient LLM inference! Many thanks to Daya Khudia and the team for this hard work. Some quick things to know: 1) Use a good server like vllm or TRT-LLM 2) Good servers have continuous batching which will group incoming requests together on a per-token level to balance latency and throughput 3) At lower batch sizes, inference latency is determined primarily by the amount of memory bandwidth. At high batch sizes it is mainly a factor of FLOPS. 4) Mixture of Expert models are awesome because fewer parameters are in use at a given time, reducing demands on memory bandwidth and FLOPs. 5) Quantization helps with little accuracy loss since it can cut memory bandwidth in proportion to the quantization size. Also FP8 on newer cards can double the effective FLOPs! https://lnkd.in/gszuGAjz
Joshua Hartman’s Post
More Relevant Posts
-
Building the future of LLMs. Cofounder & CEO, Lamini. CS Faculty at Stanford. MIT Technology Review’s 35 Under 35. (Speaker).
Excited to announce Lamini Memory Tuning, a new research breakthrough! 🎉 ◽ 95%+ accuracy, cutting hallucinations by 10x ◽ Turns any open LLM into a 1M-way adapter Mixture of Memory Experts (paper & Lamini-1 model weights on Hugging Face), e.g. one memory expert could recall facts on AMD's last earnings report, another on Nvidia's. ◽ Fortune 500 customer case study on how they memory-tuned their SQL agent to reach 95% accuracy, when they could only get to 50% with advanced RAG & instruction fine-tuning 👉 Drop us a note at [email protected] to Memory Tune on your data. We read every message. 🧱 Come to my Databricks talk at 11am today for a deep dive on the technical paper & customer case study. All the resources! 📜📜📜 Blogpost: https://lnkd.in/gPTrKdBa Case study: https://lnkd.in/gNVgBRXc Paper: https://lnkd.in/grqTsaZE Weights ✨: https://lnkd.in/g2ibWUm6 Contact us to for a demo on your data: https://lnkd.in/gcBr9XnK
Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations | Lamini - Enterprise LLM Platform
lamini.ai
To view or add a comment, sign in
-
Storing the KV cache for large language models (LLMs) has become a popular optimization technique during inference. However, as context length increases, the memory footprint of the cache can grow substantially, consuming a significant amount of GPU memory. Enter Multi Query Attention! It reduces the cache size and speeds up inference 12x! I've explained it in a blog. TL;DR - All the attention heads in a multi-head attention block share common K and V vectors. - 12x speed up with minimal performance degradation https://lnkd.in/gA-tmMkM
One Attention Head To Rule Them All
aimaximus.substack.com
To view or add a comment, sign in
-
Graph semantics and AI trends strategist / Change consultant / Veteran researcher, analyst and reporter
A year ago, Denny Vrandečić of the Wikimedia Foundation, one of the key people behind Wikidata, made some telling remarks about when not to use LLMs during a keynote at the The Knowledge Graph Conference: “Why would you ever use a 96 layer LLM with 175 billion parameters to generate a multiplication, which is a single operation with a CPU? Just because an LLM can do these kinds of things doesn’t mean they should be. Why should you be generating knowledge again and again when you can just look it up in a confident way?…. It’s just not very efficient.” What's happened in the past year since Vrandečić made this observation? Platform providers such as Fluree are making it possible to harness the power of LLMs and knowledge graphs together in ways that allow more accuracy, efficiency and security. https://t.ly/MyPQ8
GenAI Maturity: From Productivity To Effectiveness - DataScienceCentral.com
https://www.datasciencecentral.com
To view or add a comment, sign in
-
This is pretty cool and shows the dependencies between algorithms, computation and data. None can succeed without the contribution and support of the other two. Foundation models are computation and data pigs (sorry need to call it out). We are only where we are because of the existance of the algorithm plus the computation and data in a viable infrastructure to support it. To improve on where we are now we will need to continue to improve all three elements. https://lnkd.in/eUUDSNye
IBM uses Storage Scale in its AI model training – Blocks and Files
https://blocksandfiles.com
To view or add a comment, sign in
-
Building the future of AI infra. Apache Software Foundation Member; Former Head of Open Source (Ray) + Head of Field Engineering @ Anyscale.
If you are building #RAG applications, the first step is to create embeddings from your data. Checkout how you can do this with #Ray and Anyscale at scale and with unprecedented efficiency (90% cost reduction 🤯 vs. #OpenAI embedding API). If you are interested in trying this, please fill out https://lnkd.in/gDnvqZvR This is achieved through Ray's *unified* support for all steps of the pipeline with full *flexibility* on hardware / ML model for each operation (e.g. a mix of CPU, A10, and A100). 🚀 https://lnkd.in/gFeZ5akX (bit.ly/rag-embedding) Great collaboration with Pinecone team (check out their major launch https://lnkd.in/guYY7KPF!). Nathan Cordeiro Ram Sriharsha Roy Miara ❤️
RAG at Scale: 10x Cheaper Embedding Computations with Anyscale and Pinecone
anyscale.com
To view or add a comment, sign in
-
Very promising and intriguing method of improving RAG systems, reducing hallucinations by utilizing Memory Tuning by Lamini. The goal is actually to store facts in a massive mixture of millions of memory experts that are retrieved dynamically at inference time. The benchmarks look promising, Looking forward for the general public release
Building the future of LLMs. Cofounder & CEO, Lamini. CS Faculty at Stanford. MIT Technology Review’s 35 Under 35. (Speaker).
Excited to announce Lamini Memory Tuning, a new research breakthrough! 🎉 ◽ 95%+ accuracy, cutting hallucinations by 10x ◽ Turns any open LLM into a 1M-way adapter Mixture of Memory Experts (paper & Lamini-1 model weights on Hugging Face), e.g. one memory expert could recall facts on AMD's last earnings report, another on Nvidia's. ◽ Fortune 500 customer case study on how they memory-tuned their SQL agent to reach 95% accuracy, when they could only get to 50% with advanced RAG & instruction fine-tuning 👉 Drop us a note at [email protected] to Memory Tune on your data. We read every message. 🧱 Come to my Databricks talk at 11am today for a deep dive on the technical paper & customer case study. All the resources! 📜📜📜 Blogpost: https://lnkd.in/gPTrKdBa Case study: https://lnkd.in/gNVgBRXc Paper: https://lnkd.in/grqTsaZE Weights ✨: https://lnkd.in/g2ibWUm6 Contact us to for a demo on your data: https://lnkd.in/gcBr9XnK
Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations | Lamini - Enterprise LLM Platform
lamini.ai
To view or add a comment, sign in
-
Principal MLE @REA-group | Ex-Canva | Gen AI | LLMs | NLP Expert | Tinkering Transformers | Machine Learning Engineer | Data Scientist
It's been a while everyone. This is my latest blog on LLMs, specifically caching prompts for the GPU poor. Unfortunately, this is not as simple as string caching on your hard drive. What we discuss is the same technique that vLLM library use under the hood. Read on below... https://lnkd.in/gmBXNu7b #deeplearning #LLM Hugging Face
Prompt Caching: Poor man’s guide to zero shot vision-LLM classification – deepschool.ai
sachinruk.github.io
To view or add a comment, sign in
-
What is blocking LLMs from using long context length inputs? 🚨Introducing KVQuant which allows serving LLaMA-7B with 1M tokens on a single A100! 🔥 Current largest model is Claude-2.1 which is limited to 200K tokens. What is the challenge for increasing this? Two key problems: (i) Memory wall: the need to store Key and Value activations for each token, leads to a major issue for long context lengths: we quickly go out of memory even on 8 GPU systems. You can’t run LLaMA-70B with >32K on an A100 because of this. (ii) Catastrophic forgetting: LLMs do not pay attention to long sequences and either only focus on the beginning or the end. So far, training on long documents, and better positional encoding has enabled up to 200K context length, but further scaling to longer context lengths without accuracy drop is an open problem. KVQuant addresses the first challenge, and enables large context length inference by quantizing the cached Key/Value activations to ultra low precision without accuracy degradation by considering several consistent patterns observed in cached KV values across different LLMs. Why does this matter? The ability to consume such large context sizes could help increase the accuracy of LLMs by allowing one to provide different in context examples or include large files/contents in a general LLM and make it competitive with a fine-tuned model that is trained on such data. For instance, consider coding co-pilot. 10M context length provides enough space for the LLM to consume ~1M lines of code, which could help unlock new insights. Please see https://lnkd.in/gmtrsK3P for a quick TLDR of our method. Paper: https://lnkd.in/g6y7Yw3x Code: https://lnkd.in/gPmnNAET Joint work with: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael Mahoney, Sophia Shao, Kurt Keutzer
To view or add a comment, sign in
-
Anyone in for an early Christmas present? Snowpark Container Services is now in public preview! It allows you to bring any workload in any language any runtime any code to run right next to your data - and even power the compute via CPUs or GPUs. This makes Snowflake the ideal home for a whole host of new workloads: LLMs, Model Training in Data science, Data Applications and even more advanced Data Processing and Data Engineering. https://lnkd.in/eXSwK__S
Unlock the New Wave of Gen AI With Snowpark Container Services
snowflake.com
To view or add a comment, sign in
-
Inference is where majority of your GenAI Compute Cost Sky rocket!!! Efficiently serving large language models (LLMs) is critical for deploying responsive AI applications in line to your budgets. This Blog compares various Inference serving engines such as Triton Inference Server, vLLM, TGI and other leading Inference engines /Model servers each offering unique advantages in terms of latency, throughput, and specialized use cases. Its a good read that gives ideas to the right platform depending on specific requirements like model parallelism, edge computing, and CPU optimization. #GenAI #Inference #Huggingface #AICompute #TCO
7 Frameworks for Serving LLMs
betterprogramming.pub
To view or add a comment, sign in
ML Infrastructure Engineer at Moloco | KAIST PhD
5moThank you for the great article! Has your team also tried GPU-friendly attention kernels (e.g., FlashDecoding developed by TogetherAI)?