Philipp Schmid’s Post

Technical Lead & LLMs at Hugging Face 🤗 | AWS ML HERO 🦸🏻♂️

How good are LLMs in a long context, and do we need RAG? 🤔 Summary of a Haystack (SummHay) tries to solve the limitations of “Needle in a Haystack” by focusing on challenging information extraction. Google DeepMind Gemini 1.5 pro performs the best with and without RAG (37-44%), while OpenAI GPT-4o and Anthropic Claude 3 Opus are below 20%. 👀 SummHay includes 92 subtopics for evaluating long-context LLMs and RAG. It was curated by synthesizing "Haystacks" with specific insights repeated across documents. LLMs need to generate summaries that identify relevant insights and accurately cite source documents. Performance is measured using Coverage (how well the summary captures the important insights) and Citation (how accurately the summary cites the source documents). Insights 💡 RAG always improves the performance of LLMs if correct information is retrieved 📊 Evaluated 10 LLMs and 50 RAG systems, including GPT-4o, Claude 3 Opus, and Gemini-1.5-pro 🏆 Claude 3 Opus achieved the highest Coverage; Gemini-1.5-pro highest citation 🎯 Gemini-1.5-pro is the best LLM without RAG with 37.8; Claude 3 Sonnet 18.3; GPT-4o 11.4; ⚙️ Gemini-1.5-pro + Oracle RAG achieves 44.6, whereas humans achieved 56.1. 🔢 Full input is around 100,000 tokens, while Oracle RAG is reduced to 15,000 tokens 📈 Smaller Models like Claude 3 Haiku or Gemini 1.5 Flash outperform bigger LLMs (GPT-4o, Claude 3 Opus) with RAG Paper: https://lnkd.in/eFfKKxJB Github: https://lnkd.in/evHyfDmr

55 Comments

Pedro José Mora Gallegos

Artificial Intelligence B.Sc. at THI

I understand the question about whether RAG is needed with long-context LLMs. From my perspective, even with longer context capabilities, LLMs still can’t encompass all necessary information within a single context window. Additionally, relying solely on the training data for updated information is challenging, as it becomes outdated quickly. Thus, RAG seems essential to retrieve the most relevant and up-to-date data for accurate summarization and context extraction. However, I’m open to understanding where my viewpoint might fall short and would love to hear your thoughts!

24 Reactions

Greg Broadhead

Principal Consultant BROADideas Consulting and Business Architect with QSpark Group

Until the quadratic scaling issue is fully resolved without the need for pruning the input sequence we will always need something like RAG to ensure the critical semantic message is brought to the 'attention' of the LLM. As I've described in the article below, we have been able to expand the core context limit by selectively pruning information connections which keeps memory and computational usage within the realm of the reasonable, but these techniques are just work-arounds that address the architectural limitations of attention based transformers; we aren't actually _expanding_ the foundational context size in any meaningful way. https://medium.com/@greg.broadhead/working-with-ai-in-context-958d7936c42e

6 Reactions

Changsha Ma

AI Practitioner @AWS

"Need in a haystack" only scratch the surface of what enhanced context windows can achieve. Shamelessly share my reflection on this topic: https://medium.com/@machangsha/to-retrieve-or-extend-key-considerations-and-research-insights-on-using-rag-and-long-context-llms-73f4dddb08c0

6 Reactions

Ethan Nelson

AI | Deep Learning | NLP

Absolutely. RAG just adds to the context. If we out all documentation in the context do you know how much compute you'd need to use going through and infinitely long context? It's more efficient from a comoute standpoint to use RAG. If you add documents to your RAG pipeline then you don't have to make any other adjustments. If you add it to the context then you need to go in, modify the already long context and hope that your LM has enough compute to give a reasonably timed response. TLDR; Infinite contexts does not seem to scale well.

2 Reactions

Jean-Frédéric Ferté

Game designer, sentient being (in progress)

multiple layers (or passes through) llms (through MoE and specific RAGs/RAGchains) seem to be happening (complexity of the systems we are plugging into our central llm/language models/knowledge). One point that possibly is going to be crucial is the relative depth of each of the encoders/decoders and linked embedding spaces. This 'depth of sub systems' angle being linked to the main parameters/challenges we want to control/address : accuracy, versatility/adaptability (modularity of training knowledge re use and distribution/incorporation), control of nn depths (collapse...), scaling... it seems we are adjusting the parameters for a new 'complexity layer' in our informaton systems.

2 Reactions

Koushik Konwar

Making LLM’s Smarter !!

Without RAG how you will privide real time info / dynamic info ? This question itself is absurd that long context will replace RAG . In certain scenarios we might be anle to feed the LLM all the information in the prompt but it will never replace RAG

5 Reactions

J.Murat G.

I think; Of course we need RAG, the other option is not affordable in terms of token costs

3 Reactions

Refat Ametov

Pioneering Tech Innovations | Delivering Meaningful Software | Co-founder of Devstark and SpreadSimple | Stoic Mindset

It's interesting to see the impact of RAG on LLMs for long texts. Given the performance differences between models like Gemini 1.5 and others, it seems refining RAG could really help. I wonder if better RAG systems could eventually match human performance. What do you think are the biggest hurdles in improving RAG for more accurate information retrieval?

1 Reaction

Nabeel Arain

Long-context LLMs can process a lot of data, but their is still a practical limit to the amount of context they can handle efficiently. RAG can selectively retrieve relevant information, reducing the need to process the large amount of irrelevant data. While long-context LLMs reduce the need for some of the traditional uses of RAG, combining both can leverage the strengths of each approach to create more powerful and efficient AI systems.

See more comments

To view or add a comment, sign in

More Relevant Posts

Bella Xiang

Growth Manager at MyScale | Vector Database
2w
Report this post
Exploring the capabilities of LLMs in long contexts and their need for RAG, Haystack's SummHay offers a profound perspective. It aims to tackle the "Needle in a Haystack" problem by focusing on challenging information extraction tasks. 🔍 Recently, MyScale also has an insightful article by Usama Jamil that provides a comprehensive view on this topic. The article discusses the necessity of RAG and the potential obsolescence it faces as the context window expands. https://lnkd.in/gPAJBVXR According to SummHay's evaluation, we can see the performance differences of various LLMs with and without RAG. Google DeepMind's Gemini 1.5 pro demonstrates exceptional performance both on its own and when combined with RAG, while OpenAI's GPT-4o and Anthropic's Claude 3 Opus perform below 20% without RAG. This brings up several key points: - RAG consistently improves the performance of LLMs when the correct information is retrieved. - Among the 10 LLMs and 50 RAG systems evaluated, Claude 3 Opus showed the best Coverage, while Gemini-1.5-pro had the highest citation accuracy. - Gemini-1.5-pro topped the list with a 37.8% performance without RAG, whereas human performance was at 56.1%, highlighting the gap between machine learning models and human understanding. Additionally, it's worth noting that even smaller models, such as Claude 3 Haiku or Gemini 1.5 Flash, outperform larger LLMs like GPT-4o and Claude 3 Opus when combined with RAG. This indicates that model size is not always the decisive factor in long-context processing. As technology advances, we continuously seek the perfect balance between query quality and cost. Tools like RAG and vector databases play a vital role in achieving this goal. 🌐 #LLM #RAG #AI #Haystack #SummHay #Gemini1.5pro #GPT4o #Claude3Opus #vectordatabase
Philipp Schmid

Technical Lead & LLMs at Hugging Face 🤗 | AWS ML HERO 🦸🏻♂️
2w

How good are LLMs in a long context, and do we need RAG? 🤔 Summary of a Haystack (SummHay) tries to solve the limitations of “Needle in a Haystack” by focusing on challenging information extraction. Google DeepMind Gemini 1.5 pro performs the best with and without RAG (37-44%), while OpenAI GPT-4o and Anthropic Claude 3 Opus are below 20%. 👀 SummHay includes 92 subtopics for evaluating long-context LLMs and RAG. It was curated by synthesizing "Haystacks" with specific insights repeated across documents. LLMs need to generate summaries that identify relevant insights and accurately cite source documents. Performance is measured using Coverage (how well the summary captures the important insights) and Citation (how accurately the summary cites the source documents). Insights 💡 RAG always improves the performance of LLMs if correct information is retrieved 📊 Evaluated 10 LLMs and 50 RAG systems, including GPT-4o, Claude 3 Opus, and Gemini-1.5-pro 🏆 Claude 3 Opus achieved the highest Coverage; Gemini-1.5-pro highest citation 🎯 Gemini-1.5-pro is the best LLM without RAG with 37.8; Claude 3 Sonnet 18.3; GPT-4o 11.4; ⚙️ Gemini-1.5-pro + Oracle RAG achieves 44.6, whereas humans achieved 56.1. 🔢 Full input is around 100,000 tokens, while Oracle RAG is reduced to 15,000 tokens 📈 Smaller Models like Claude 3 Haiku or Gemini 1.5 Flash outperform bigger LLMs (GPT-4o, Claude 3 Opus) with RAG Paper: https://lnkd.in/eFfKKxJB Github: https://lnkd.in/evHyfDmr
Like Comment
To view or add a comment, sign in
Charles H. Martin, PhD

AI Specialist and Distinguished Engineer (NLP & Search). Inventor of weightwatcher.ai . TEDx Speaker. Need help with AI ? #talkToChuck
2w
Report this post
I did not realize that there are 50 different RAG implementations.
Philipp Schmid

Technical Lead & LLMs at Hugging Face 🤗 | AWS ML HERO 🦸🏻♂️
2w

How good are LLMs in a long context, and do we need RAG? 🤔 Summary of a Haystack (SummHay) tries to solve the limitations of “Needle in a Haystack” by focusing on challenging information extraction. Google DeepMind Gemini 1.5 pro performs the best with and without RAG (37-44%), while OpenAI GPT-4o and Anthropic Claude 3 Opus are below 20%. 👀 SummHay includes 92 subtopics for evaluating long-context LLMs and RAG. It was curated by synthesizing "Haystacks" with specific insights repeated across documents. LLMs need to generate summaries that identify relevant insights and accurately cite source documents. Performance is measured using Coverage (how well the summary captures the important insights) and Citation (how accurately the summary cites the source documents). Insights 💡 RAG always improves the performance of LLMs if correct information is retrieved 📊 Evaluated 10 LLMs and 50 RAG systems, including GPT-4o, Claude 3 Opus, and Gemini-1.5-pro 🏆 Claude 3 Opus achieved the highest Coverage; Gemini-1.5-pro highest citation 🎯 Gemini-1.5-pro is the best LLM without RAG with 37.8; Claude 3 Sonnet 18.3; GPT-4o 11.4; ⚙️ Gemini-1.5-pro + Oracle RAG achieves 44.6, whereas humans achieved 56.1. 🔢 Full input is around 100,000 tokens, while Oracle RAG is reduced to 15,000 tokens 📈 Smaller Models like Claude 3 Haiku or Gemini 1.5 Flash outperform bigger LLMs (GPT-4o, Claude 3 Opus) with RAG Paper: https://lnkd.in/eFfKKxJB Github: https://lnkd.in/evHyfDmr
4 Comments
Like Comment
To view or add a comment, sign in
Amit Kumar

GenAI Practitioner | AI ML Enthusiast | Empowering Businesses | Building decision-making systems
2w
Report this post
When it comes to quering about knowledge which is not contained within the LLM, the magic lies in how we balance the need to have a better generative capability vs better retrieval. RAG’s adopted in the industry as proprietary data is not a part of LLMs, atleast that’s what we know is the case 😄, and to serve customers better, one needs accurate information rather than stupendous generation to facitilate decision making. #RAG #LongContext #LLM
Philipp Schmid

Technical Lead & LLMs at Hugging Face 🤗 | AWS ML HERO 🦸🏻♂️
2w

How good are LLMs in a long context, and do we need RAG? 🤔 Summary of a Haystack (SummHay) tries to solve the limitations of “Needle in a Haystack” by focusing on challenging information extraction. Google DeepMind Gemini 1.5 pro performs the best with and without RAG (37-44%), while OpenAI GPT-4o and Anthropic Claude 3 Opus are below 20%. 👀 SummHay includes 92 subtopics for evaluating long-context LLMs and RAG. It was curated by synthesizing "Haystacks" with specific insights repeated across documents. LLMs need to generate summaries that identify relevant insights and accurately cite source documents. Performance is measured using Coverage (how well the summary captures the important insights) and Citation (how accurately the summary cites the source documents). Insights 💡 RAG always improves the performance of LLMs if correct information is retrieved 📊 Evaluated 10 LLMs and 50 RAG systems, including GPT-4o, Claude 3 Opus, and Gemini-1.5-pro 🏆 Claude 3 Opus achieved the highest Coverage; Gemini-1.5-pro highest citation 🎯 Gemini-1.5-pro is the best LLM without RAG with 37.8; Claude 3 Sonnet 18.3; GPT-4o 11.4; ⚙️ Gemini-1.5-pro + Oracle RAG achieves 44.6, whereas humans achieved 56.1. 🔢 Full input is around 100,000 tokens, while Oracle RAG is reduced to 15,000 tokens 📈 Smaller Models like Claude 3 Haiku or Gemini 1.5 Flash outperform bigger LLMs (GPT-4o, Claude 3 Opus) with RAG Paper: https://lnkd.in/eFfKKxJB Github: https://lnkd.in/evHyfDmr
Like Comment
To view or add a comment, sign in
Sergey Ivanov

Senior Applied Scientist @ Amazon Web Services
7mo
Report this post
🌟Zero-Shot → Few-Shot → Chain-of-Thought → Self-Consistency → Tree-of-Thought → Graph-of-Thought🌟 There has been a lot of discussion about Gemini evaluation on MMLU using CoT@32. Is it fair or we should stick to few-shot@5? In short, there are many ways to evaluate LLMs, even within the same dataset, which can drastically change the final rankings. 1. Simplest aka zero-shot. Ask a question, get a response. Then parse generated text for the final answer. 2. Few-shot@k. Before asking a question, add k similar questions AND responses to them, so that LLM will understand what to expect. 3. Chain-of-Thought (CoT). It's the same as with few-shot, but instead of giving a direct answer in your examples, also add some reasoning on how you arrive to this answer. 4. Self-Consistency aka CoT@k. It's the same as CoT, but here we generate k responses with LLM with different temperatures, and then select the most popular answer. 5. Tree-of-Thought (ToT) and Graph-of-Thought. During the CoT process of LLM, directly intervene to decide if the reasoning step is correct and roll back if it's not. Observations: 1. Zero-shot is what we use as consumers of LLMs. 2. More advanced methods like Few-shot@k and CoT make sense when we solve more complex tasks such as in MMLU. 3. ToT actually requires additional logic and models to verify the reasoning process and is hard to implement, and is evaluated on NP-hard tasks like sudoku. What Gemini used? Gemini actually used a variety of methods, depending on the dataset: CoT@32, few-shot, zero-shot. For MMLU they compared CoT@32 and few-shot@5 (this is what GPT-4 used). They also made a tweak to CoT@32 so that when there is disagreement between responses, they turn off the temperature and use one more generation with greedy decoding. This has been clearly done to improve MMLU benchmark from 83.7% with 5-shot (with this score it loses to GPT-4) to 90.04% of CoT32 (where they beat GPT-4) and pass human baseline (89.8%). From the point of view of the end user we should not care which method is used: CoT or few-shot or something else, as long as it provides us the correct answer. In fact, we will most likely use zero-shot, because anything else requires more time for us to prompt. From the point of view of beating the humans in MMLU, it's somewhat tricky as most likely humans are prompted zero-shot and not with additional examples of how to solve a task. So it's unfair to say that we outperformed humans on this dataset. However, is it fair to use CoT@32 for LLMs? It is fair for Gemini yes, but it's not fair to other LLMs. As we know different LLMs could have different performance on the little changes to prompts and evaluations. By optimizing CoT@32 for Gemini, Google didn't optimize this CoT for other models. In other words, maybe GPT-4 can come up with another chain-of-thought evaluation method to beat Gemini. Few-shot was a standard way of doing it and now everyone can come up with their own scheme.
Like Comment
To view or add a comment, sign in
Duc Phuong Nguyen

AI-Powered Creative Design
11mo
Report this post
Evaluation on performance and cost for using Llama 2 for summarization. #generativeai #llms #llama2 #aisummarize https://lnkd.in/gHHWvgUw

Llama 2 vs. GPT-4: Nearly As Accurate and 30X Cheaper

anyscale.com
Like Comment
To view or add a comment, sign in
Pavan Belagatti Pavan Belagatti is an Influencer

Developer Evangelist | LinkedIn Top Voice | AI/ML| Data Science | DevOps | Tech Content Creator
6mo
Report this post
Let's Understand the Retrieval Augmented Generation (#RAG) Pipeline. The Retrieval Augmented Generation (RAG) Pipeline integrates external data with a generative model like GPT to enhance response quality. A user inputs a prompt, which is transformed into an embedding by an embedding model. This embedding queries a vector database of pre-processed data embeddings to retrieve relevant contexts. The generative model then incorporates this context to produce a response that is semantically rich and informed by external data. This response is sent back to the user, ensuring the output is not only coherent but also factually relevant to the prompt provided. The RAG pipeline is especially useful for tasks where the generative model needs to provide factual information or detailed content that is not contained within its pre-trained knowledge. By retrieving relevant context from an external database (vector database), the RAG system ensures that the generated output is both high-quality and contextually informed. Here is my in-depth article on RAG: https://lnkd.in/g7XUj-DD -------------------------------------------------------- Use SingleStore as your vector database for all your GenAI applications. Try SingleStore for free: https://lnkd.in/gCAbwtTC
1 Comment
Like Comment
To view or add a comment, sign in
Li Yin

Author of AdalFlow | AI researcher | x MetaAI
4mo Edited
Report this post
Will long-context LLMs kill RAG? And what would it take to replace RAG? Yes, having a 200k (Claude 3) to 1M (Gemini 1.5) context window would be nice. For easy use case, you for sure don't need to build RAG. But it still suffers from: -- Slow speed: Gemini 1.5 takes around 60 seconds to process 500k to 1M tokens, versus embedding retrieval which is within milliseconds. -- High cost: Most companies cannot even afford GPT-4 in their production; how much would they charge for a 1M context window? -- Small Volume: 1M tokens are nothing in comparison to the whole volume of data on the internet. These long-context LLMs will only empower RAG for now. So, what would it take to replace RAG? RAG is about adapting the LLM to any domain by accessing external knowledge and combating the hallucination of LLMs. Hallucinations would not exist if LLMs could cite the source for every response by default. 💡 The paper “Transformer memory as a differentiable search index” learns a text-to-text model that maps string queries directly to a ranked list of relevant docIDs (LLMs as retriever and ranker). Built on top of this: (1) if it can still take external structured and unstructured data at test time, and (2) if it can generate responses along with the relevant docIDs. This could be the real future of RAG and search. 👉 I am currently working on the top 10 challenges of applying LLMs in production. Follow to stay updated on my findings and repost to let more people know. #ai #ml #rag #llms #searchenginetechnology
13 Comments
Like Comment
To view or add a comment, sign in
Bhaskarjit Sarmah

Head RQA AI Labs at BlackRock | Gen AI Leader
2mo Edited
Report this post
𝐕𝐢𝐬𝐮𝐚𝐥 𝐏𝐚𝐩𝐞𝐫 𝐒𝐮𝐦𝐦𝐚𝐫𝐲 #13: 𝐂𝐡𝐚𝐢𝐧 𝐨𝐟 𝐃𝐞𝐧𝐬𝐢𝐭𝐲 𝐏𝐫𝐨𝐦𝐩𝐭𝐢𝐧𝐠. In continuation of advanced prompt engineering technique series for LLM Reasoning tasks! In this post we will explain the paper titled: From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting. This study explores how to make summaries with just the right amount of detail using a method called 'Chain of Density', where an LLM (GPT-4) creates summaries that get more detailed without getting longer. People liked these detailed summaries more than simpler ones made by the LLM without this method, showing that finding the balance between information and clarity is key. Chain of Density prompting is used to generate high quality summaries 𝐏𝐫𝐞𝐯𝐢𝐨𝐮𝐬 𝐩𝐨𝐬𝐭𝐬 𝐢𝐧 𝐭𝐡𝐢𝐬 𝐬𝐞𝐫𝐢𝐞𝐬 - Chain-of-Thought Prompting - https://lnkd.in/gPyyPNkD Self-Consistency Prompting - https://lnkd.in/dkzW5Jmz Verify-and-Edit Prompting - https://lnkd.in/dzbzwvkW Tree of Thoughts Prompting - https://lnkd.in/gKXybmCX Graph of Thoughts Prompting - https://lnkd.in/dHHk-QhQ ReAct Prompting - https://lnkd.in/dVbcwxuc Algorithm of Thoughts Prompting - https://lnkd.in/dAYtVnn2 Rephrase and Respond Prompting - https://lnkd.in/d3dGZC4t Self-Refine Prompting - https://lnkd.in/dkSpgVmp Chain of Natural Language Inferencing - https://lnkd.in/dSkqSbfU Chain of Verification Prompting - https://lnkd.in/gx7KXtJZ https://lnkd.in/gMqKfhnc

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

visualsummary.substack.com
Like Comment
To view or add a comment, sign in
NLPlanet | Breaking Down Generative AI Daily

11,018 followers
4mo
Report this post
Performance of LLMs with long contexts may be good for retrieving one needle, but greatly decreases when retrieving multiple needles and reasoning about them 😮 💁 One of the most popular and cited benchmarks for long context LLM retrieval is Greg Kamradt's Needle in A Haystack: a fact (needle) is injected into a (haystack) of context (e.g., Paul Graham essays) and the LLM is asked a question related to this fact. 👀 But, this isn't fully reflective of many retrieval augmented generation (RAG) applications; RAG is often focused on retrieving multiple facts (from an index) and then reasoning over them. The authors present the new "Multi-Needle + Reasoning" benchmark and the following results 👇 1️⃣ Performance degrades as you ask LLMs to retrieve more facts 2️⃣ Performance degrades as the context size increases from 1.000 to 120.000 tokens. GPT-4 fails to retrieve needles towards the start of documents as context length increases. 3️⃣ Performance degrades when the LLM has to reason about retrieved facts. —————————————————— Want to stay at the forefront of Generative AI developments? Follow NLPlanet for daily insights into the most relevant news, guides, and research! 🚀

Multi Needle in a Haystack

blog.langchain.dev
Like Comment
To view or add a comment, sign in

105,507 followers

626 Posts

View Profile Follow

Philipp Schmid’s Post

More Relevant Posts

Explore topics