Joshua Hartman’s Post

Director of Engineering, Databricks ML Platform - I'm hiring!

5mo

Super excited to share some of the techniques we are using for efficient LLM inference! Many thanks to Daya Khudia and the team for this hard work. Some quick things to know: 1) Use a good server like vllm or TRT-LLM 2) Good servers have continuous batching which will group incoming requests together on a per-token level to balance latency and throughput 3) At lower batch sizes, inference latency is determined primarily by the amount of memory bandwidth. At high batch sizes it is mainly a factor of FLOPS. 4) Mixture of Expert models are awesome because fewer parameters are in use at a given time, reducing demands on memory bandwidth and FLOPs. 5) Quantization helps with little accuracy loss since it can cut memory bandwidth in proportion to the quantization size. Also FP8 on newer cards can double the effective FLOPs! https://lnkd.in/gszuGAjz

Fast, Secure and Reliable: Enterprise-grade LLM Inference

databricks.com

1 Comment

Hyunho Yeo

ML Infrastructure Engineer at Moloco | KAIST PhD

5mo

Thank you for the great article! Has your team also tried GPU-friendly attention kernels (e.g., FlashDecoding developed by TogetherAI)?

To view or add a comment, sign in

More Relevant Posts

Sharon Zhou, PhD

Building the future of LLMs. Cofounder & CEO, Lamini. CS Faculty at Stanford. MIT Technology Review’s 35 Under 35. (Speaker).
2mo
Report this post
Excited to announce Lamini Memory Tuning, a new research breakthrough! 🎉 ◽ 95%+ accuracy, cutting hallucinations by 10x ◽ Turns any open LLM into a 1M-way adapter Mixture of Memory Experts (paper & Lamini-1 model weights on Hugging Face), e.g. one memory expert could recall facts on AMD's last earnings report, another on Nvidia's. ◽ Fortune 500 customer case study on how they memory-tuned their SQL agent to reach 95% accuracy, when they could only get to 50% with advanced RAG & instruction fine-tuning 👉 Drop us a note at [email protected] to Memory Tune on your data. We read every message. 🧱 Come to my Databricks talk at 11am today for a deep dive on the technical paper & customer case study. All the resources! 📜📜📜 Blogpost: https://lnkd.in/gPTrKdBa Case study: https://lnkd.in/gNVgBRXc Paper: https://lnkd.in/grqTsaZE Weights ✨: https://lnkd.in/g2ibWUm6 Contact us to for a demo on your data: https://lnkd.in/gcBr9XnK

Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations | Lamini - Enterprise LLM Platform

lamini.ai

28 Comments
Like Comment
To view or add a comment, sign in
Shikhar Gupta

ML @ Informed | University of Michigan
1mo
Report this post
Storing the KV cache for large language models (LLMs) has become a popular optimization technique during inference. However, as context length increases, the memory footprint of the cache can grow substantially, consuming a significant amount of GPU memory. Enter Multi Query Attention! It reduces the cache size and speeds up inference 12x! I've explained it in a blog. TL;DR - All the attention heads in a multi-head attention block share common K and V vectors. - 12x speed up with minimal performance degradation https://lnkd.in/gA-tmMkM

One Attention Head To Rule Them All

aimaximus.substack.com

3 Comments
Like Comment
To view or add a comment, sign in
Alan Morrison

Graph semantics and AI trends strategist / Change consultant / Veteran researcher, analyst and reporter
1mo
Report this post
A year ago, Denny Vrandečić of the Wikimedia Foundation, one of the key people behind Wikidata, made some telling remarks about when not to use LLMs during a keynote at the The Knowledge Graph Conference: “Why would you ever use a 96 layer LLM with 175 billion parameters to generate a multiplication, which is a single operation with a CPU? Just because an LLM can do these kinds of things doesn’t mean they should be. Why should you be generating knowledge again and again when you can just look it up in a confident way?…. It’s just not very efficient.” What's happened in the past year since Vrandečić made this observation? Platform providers such as Fluree are making it possible to harness the power of LLMs and knowledge graphs together in ways that allow more accuracy, efficiency and security. https://t.ly/MyPQ8

GenAI Maturity: From Productivity To Effectiveness - DataScienceCentral.com

https://www.datasciencecentral.com

4 Comments
Like Comment
To view or add a comment, sign in
Kirk Mettler

Chief Data Scientist and R guy at IBM
3w
Report this post
This is pretty cool and shows the dependencies between algorithms, computation and data. None can succeed without the contribution and support of the other two. Foundation models are computation and data pigs (sorry need to call it out). We are only where we are because of the existance of the algorithm plus the computation and data in a viable infrastructure to support it. To improve on where we are now we will need to continue to improve all three elements. https://lnkd.in/eUUDSNye

IBM uses Storage Scale in its AI model training – Blocks and Files

https://blocksandfiles.com

1 Comment
Like Comment
To view or add a comment, sign in
Zhe Zhang

Building the future of AI infra. Apache Software Foundation Member; Former Head of Open Source (Ray) + Head of Field Engineering @ Anyscale.
7mo Edited
Report this post
If you are building #RAG applications, the first step is to create embeddings from your data. Checkout how you can do this with #Ray and Anyscale at scale and with unprecedented efficiency (90% cost reduction 🤯 vs. #OpenAI embedding API). If you are interested in trying this, please fill out https://lnkd.in/gDnvqZvR This is achieved through Ray's *unified* support for all steps of the pipeline with full *flexibility* on hardware / ML model for each operation (e.g. a mix of CPU, A10, and A100). 🚀 https://lnkd.in/gFeZ5akX (bit.ly/rag-embedding) Great collaboration with Pinecone team (check out their major launch https://lnkd.in/guYY7KPF!). Nathan Cordeiro Ram Sriharsha Roy Miara ❤️

RAG at Scale: 10x Cheaper Embedding Computations with Anyscale and Pinecone

anyscale.com

3 Comments
Like Comment
To view or add a comment, sign in
Michael Buloichyk

AI-engineering, Agents. LLMs Perseverance is all you need!
2mo
Report this post
Very promising and intriguing method of improving RAG systems, reducing hallucinations by utilizing Memory Tuning by Lamini. The goal is actually to store facts in a massive mixture of millions of memory experts that are retrieved dynamically at inference time. The benchmarks look promising, Looking forward for the general public release

Sharon Zhou, PhD

Building the future of LLMs. Cofounder & CEO, Lamini. CS Faculty at Stanford. MIT Technology Review’s 35 Under 35. (Speaker).
2mo

Excited to announce Lamini Memory Tuning, a new research breakthrough! 🎉 ◽ 95%+ accuracy, cutting hallucinations by 10x ◽ Turns any open LLM into a 1M-way adapter Mixture of Memory Experts (paper & Lamini-1 model weights on Hugging Face), e.g. one memory expert could recall facts on AMD's last earnings report, another on Nvidia's. ◽ Fortune 500 customer case study on how they memory-tuned their SQL agent to reach 95% accuracy, when they could only get to 50% with advanced RAG & instruction fine-tuning 👉 Drop us a note at [email protected] to Memory Tune on your data. We read every message. 🧱 Come to my Databricks talk at 11am today for a deep dive on the technical paper & customer case study. All the resources! 📜📜📜 Blogpost: https://lnkd.in/gPTrKdBa Case study: https://lnkd.in/gNVgBRXc Paper: https://lnkd.in/grqTsaZE Weights ✨: https://lnkd.in/g2ibWUm6 Contact us to for a demo on your data: https://lnkd.in/gcBr9XnK

Introducing Lamini Memory Tuning: 95% LLM Accuracy, 10x Fewer Hallucinations | Lamini - Enterprise LLM Platform

lamini.ai
Like Comment
To view or add a comment, sign in
Sachin Abeywardana, PhD

Principal MLE @REA-group | Ex-Canva | Gen AI | LLMs | NLP Expert | Tinkering Transformers | Machine Learning Engineer | Data Scientist
1mo
Report this post
It's been a while everyone. This is my latest blog on LLMs, specifically caching prompts for the GPU poor. Unfortunately, this is not as simple as string caching on your hard drive. What we discuss is the same technique that vLLM library use under the hood. Read on below... https://lnkd.in/gmBXNu7b #deeplearning #LLM Hugging Face

Prompt Caching: Poor man’s guide to zero shot vision-LLM classification – deepschool.ai

sachinruk.github.io
Like Comment
To view or add a comment, sign in
Amir Gholami

Research Scientist, BAIR/SkyLab, UC Berkeley
6mo Edited
Report this post
What is blocking LLMs from using long context length inputs? 🚨Introducing KVQuant which allows serving LLaMA-7B with 1M tokens on a single A100! 🔥 Current largest model is Claude-2.1 which is limited to 200K tokens. What is the challenge for increasing this? Two key problems: (i) Memory wall: the need to store Key and Value activations for each token, leads to a major issue for long context lengths: we quickly go out of memory even on 8 GPU systems. You can’t run LLaMA-70B with >32K on an A100 because of this. (ii) Catastrophic forgetting: LLMs do not pay attention to long sequences and either only focus on the beginning or the end. So far, training on long documents, and better positional encoding has enabled up to 200K context length, but further scaling to longer context lengths without accuracy drop is an open problem. KVQuant addresses the first challenge, and enables large context length inference by quantizing the cached Key/Value activations to ultra low precision without accuracy degradation by considering several consistent patterns observed in cached KV values across different LLMs. Why does this matter? The ability to consume such large context sizes could help increase the accuracy of LLMs by allowing one to provide different in context examples or include large files/contents in a general LLM and make it competitive with a fine-tuned model that is trained on such data. For instance, consider coding co-pilot. 10M context length provides enough space for the LLM to consume ~1M lines of code, which could help unlock new insights. Please see https://lnkd.in/gmtrsK3P for a quick TLDR of our method. Paper: https://lnkd.in/g6y7Yw3x Code: https://lnkd.in/gPmnNAET Joint work with: Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael Mahoney, Sophia Shao, Kurt Keutzer
13 Comments
Like Comment
To view or add a comment, sign in
Ulrich Baum

❄️ Data applications work best in the Data Cloud ❄️
8mo
Report this post
Anyone in for an early Christmas present? Snowpark Container Services is now in public preview! It allows you to bring any workload in any language any runtime any code to run right next to your data - and even power the compute via CPUs or GPUs. This makes Snowflake the ideal home for a whole host of new workloads: LLMs, Model Training in Data science, Data Applications and even more advanced Data Processing and Data Engineering. https://lnkd.in/eXSwK__S

Unlock the New Wave of Gen AI With Snowpark Container Services

snowflake.com
Like Comment
To view or add a comment, sign in
Sreejith G Krishnan

Looking for a new opportunity-Solutions Architect - Ex Intel, | Dublin, Ireland Region
4w
Report this post
Inference is where majority of your GenAI Compute Cost Sky rocket!!! Efficiently serving large language models (LLMs) is critical for deploying responsive AI applications in line to your budgets. This Blog compares various Inference serving engines such as Triton Inference Server, vLLM, TGI and other leading Inference engines /Model servers each offering unique advantages in terms of latency, throughput, and specialized use cases. Its a good read that gives ideas to the right platform depending on specific requirements like model parallelism, edge computing, and CPU optimization. #GenAI #Inference #Huggingface #AICompute #TCO

7 Frameworks for Serving LLMs

betterprogramming.pub
Like Comment
To view or add a comment, sign in

5,185 followers

View Profile Follow

Joshua Hartman’s Post

Fast, Secure and Reliable: Enterprise-grade LLM Inference

databricks.com

More from this author

REST vs RPC - the SOA showdown

Parent-child relationships and you

Measuring Craftsmanship

Explore topics