How good are LLMs in a long context, and do we need RAG? 🤔 Summary of a Haystack (SummHay) tries to solve the limitations of “Needle in a Haystack” by focusing on challenging information extraction. Google DeepMind Gemini 1.5 pro performs the best with and without RAG (37-44%), while OpenAI GPT-4o and Anthropic Claude 3 Opus are below 20%. 👀 SummHay includes 92 subtopics for evaluating long-context LLMs and RAG. It was curated by synthesizing "Haystacks" with specific insights repeated across documents. LLMs need to generate summaries that identify relevant insights and accurately cite source documents. Performance is measured using Coverage (how well the summary captures the important insights) and Citation (how accurately the summary cites the source documents). Insights 💡 RAG always improves the performance of LLMs if correct information is retrieved 📊 Evaluated 10 LLMs and 50 RAG systems, including GPT-4o, Claude 3 Opus, and Gemini-1.5-pro 🏆 Claude 3 Opus achieved the highest Coverage; Gemini-1.5-pro highest citation 🎯 Gemini-1.5-pro is the best LLM without RAG with 37.8; Claude 3 Sonnet 18.3; GPT-4o 11.4; ⚙️ Gemini-1.5-pro + Oracle RAG achieves 44.6, whereas humans achieved 56.1. 🔢 Full input is around 100,000 tokens, while Oracle RAG is reduced to 15,000 tokens 📈 Smaller Models like Claude 3 Haiku or Gemini 1.5 Flash outperform bigger LLMs (GPT-4o, Claude 3 Opus) with RAG Paper: https://lnkd.in/eFfKKxJB Github: https://lnkd.in/evHyfDmr
Until the quadratic scaling issue is fully resolved without the need for pruning the input sequence we will always need something like RAG to ensure the critical semantic message is brought to the 'attention' of the LLM. As I've described in the article below, we have been able to expand the core context limit by selectively pruning information connections which keeps memory and computational usage within the realm of the reasonable, but these techniques are just work-arounds that address the architectural limitations of attention based transformers; we aren't actually _expanding_ the foundational context size in any meaningful way. https://medium.com/@greg.broadhead/working-with-ai-in-context-958d7936c42e
"Need in a haystack" only scratch the surface of what enhanced context windows can achieve. Shamelessly share my reflection on this topic: https://medium.com/@machangsha/to-retrieve-or-extend-key-considerations-and-research-insights-on-using-rag-and-long-context-llms-73f4dddb08c0
Absolutely. RAG just adds to the context. If we out all documentation in the context do you know how much compute you'd need to use going through and infinitely long context? It's more efficient from a comoute standpoint to use RAG. If you add documents to your RAG pipeline then you don't have to make any other adjustments. If you add it to the context then you need to go in, modify the already long context and hope that your LM has enough compute to give a reasonably timed response. TLDR; Infinite contexts does not seem to scale well.
multiple layers (or passes through) llms (through MoE and specific RAGs/RAGchains) seem to be happening (complexity of the systems we are plugging into our central llm/language models/knowledge). One point that possibly is going to be crucial is the relative depth of each of the encoders/decoders and linked embedding spaces. This 'depth of sub systems' angle being linked to the main parameters/challenges we want to control/address : accuracy, versatility/adaptability (modularity of training knowledge re use and distribution/incorporation), control of nn depths (collapse...), scaling... it seems we are adjusting the parameters for a new 'complexity layer' in our informaton systems.
Without RAG how you will privide real time info / dynamic info ? This question itself is absurd that long context will replace RAG . In certain scenarios we might be anle to feed the LLM all the information in the prompt but it will never replace RAG
I think; Of course we need RAG, the other option is not affordable in terms of token costs
It's interesting to see the impact of RAG on LLMs for long texts. Given the performance differences between models like Gemini 1.5 and others, it seems refining RAG could really help. I wonder if better RAG systems could eventually match human performance. What do you think are the biggest hurdles in improving RAG for more accurate information retrieval?
Long-context LLMs can process a lot of data, but their is still a practical limit to the amount of context they can handle efficiently. RAG can selectively retrieve relevant information, reducing the need to process the large amount of irrelevant data. While long-context LLMs reduce the need for some of the traditional uses of RAG, combining both can leverage the strengths of each approach to create more powerful and efficient AI systems.
Artificial Intelligence B.Sc. at THI
2wI understand the question about whether RAG is needed with long-context LLMs. From my perspective, even with longer context capabilities, LLMs still can’t encompass all necessary information within a single context window. Additionally, relying solely on the training data for updated information is challenging, as it becomes outdated quickly. Thus, RAG seems essential to retrieve the most relevant and up-to-date data for accurate summarization and context extraction. However, I’m open to understanding where my viewpoint might fall short and would love to hear your thoughts!