Philipp Schmid’s Post

View profile for Philipp Schmid, graphic

Technical Lead & LLMs at Hugging Face 🤗 | AWS ML HERO 🦸🏻♂️

How good are LLMs in a long context, and do we need RAG? 🤔 Summary of a Haystack (SummHay) tries to solve the limitations of “Needle in a Haystack” by focusing on challenging information extraction. Google DeepMind Gemini 1.5 pro performs the best with and without RAG (37-44%), while OpenAI GPT-4o and Anthropic Claude 3 Opus are below 20%. 👀 SummHay includes 92 subtopics for evaluating long-context LLMs and RAG. It was curated by synthesizing "Haystacks" with specific insights repeated across documents. LLMs need to generate summaries that identify relevant insights and accurately cite source documents. Performance is measured using Coverage (how well the summary captures the important insights) and Citation (how accurately the summary cites the source documents). Insights 💡 RAG always improves the performance of LLMs if correct information is retrieved 📊 Evaluated 10 LLMs and 50 RAG systems, including GPT-4o, Claude 3 Opus, and Gemini-1.5-pro 🏆 Claude 3 Opus achieved the highest Coverage; Gemini-1.5-pro highest citation 🎯 Gemini-1.5-pro is the best LLM without RAG with 37.8; Claude 3 Sonnet 18.3; GPT-4o 11.4; ⚙️ Gemini-1.5-pro + Oracle RAG achieves 44.6, whereas humans achieved 56.1. 🔢 Full input is around 100,000 tokens, while Oracle RAG is reduced to 15,000 tokens 📈 Smaller Models like Claude 3 Haiku or Gemini 1.5 Flash outperform bigger LLMs (GPT-4o, Claude 3 Opus) with RAG Paper: https://lnkd.in/eFfKKxJB Github: https://lnkd.in/evHyfDmr

  • No alternative text description for this image
Pedro José Mora Gallegos

Artificial Intelligence B.Sc. at THI

2w

I understand the question about whether RAG is needed with long-context LLMs. From my perspective, even with longer context capabilities, LLMs still can’t encompass all necessary information within a single context window. Additionally, relying solely on the training data for updated information is challenging, as it becomes outdated quickly. Thus, RAG seems essential to retrieve the most relevant and up-to-date data for accurate summarization and context extraction. However, I’m open to understanding where my viewpoint might fall short and would love to hear your thoughts!

Greg Broadhead

Principal Consultant BROADideas Consulting and Business Architect with QSpark Group

2w

Until the quadratic scaling issue is fully resolved without the need for pruning the input sequence we will always need something like RAG to ensure the critical semantic message is brought to the 'attention' of the LLM. As I've described in the article below, we have been able to expand the core context limit by selectively pruning information connections which keeps memory and computational usage within the realm of the reasonable, but these techniques are just work-arounds that address the architectural limitations of attention based transformers; we aren't actually _expanding_ the foundational context size in any meaningful way. https://medium.com/@greg.broadhead/working-with-ai-in-context-958d7936c42e

Changsha Ma

AI Practitioner @AWS

2w

"Need in a haystack" only scratch the surface of what enhanced context windows can achieve. Shamelessly share my reflection on this topic: https://medium.com/@machangsha/to-retrieve-or-extend-key-considerations-and-research-insights-on-using-rag-and-long-context-llms-73f4dddb08c0

Ethan Nelson

AI | Deep Learning | NLP

2w

Absolutely. RAG just adds to the context. If we out all documentation in the context do you know how much compute you'd need to use going through and infinitely long context? It's more efficient from a comoute standpoint to use RAG. If you add documents to your RAG pipeline then you don't have to make any other adjustments. If you add it to the context then you need to go in, modify the already long context and hope that your LM has enough compute to give a reasonably timed response. TLDR; Infinite contexts does not seem to scale well.

Jean-Frédéric Ferté

Game designer, sentient being (in progress)

2w

multiple layers (or passes through) llms (through MoE and specific RAGs/RAGchains) seem to be happening (complexity of the systems we are plugging into our central llm/language models/knowledge). One point that possibly is going to be crucial is the relative depth of each of the encoders/decoders and linked embedding spaces. This 'depth of sub systems' angle being linked to the main parameters/challenges we want to control/address : accuracy, versatility/adaptability (modularity of training knowledge re use and distribution/incorporation), control of nn depths (collapse...), scaling... it seems we are adjusting the parameters for a new 'complexity layer' in our informaton systems.

Koushik Konwar

Making LLM’s Smarter !!

2w

Without RAG how you will privide real time info / dynamic info ? This question itself is absurd that long context will replace RAG . In certain scenarios we might be anle to feed the LLM all the information in the prompt but it will never replace RAG

J.Murat G.

Co-Founder, Board Member at Aimped | AI Solutions Architect | Generative AI | NLP Certification, Master's Degree | Taught 3000+ ML Students Globally | Digital Marketing

2w

I think; Of course we need RAG, the other option is not affordable in terms of token costs

Refat Ametov

Pioneering Tech Innovations | Delivering Meaningful Software | Co-founder of Devstark and SpreadSimple | Stoic Mindset

2w

It's interesting to see the impact of RAG on LLMs for long texts. Given the performance differences between models like Gemini 1.5 and others, it seems refining RAG could really help. I wonder if better RAG systems could eventually match human performance. What do you think are the biggest hurdles in improving RAG for more accurate information retrieval?

Nabeel Arain

GenAI & ML Enthusiast | Data Science | Web Developer | ReactJs | JavaScript | Java Expert | Python Expert | C++ | DSA | GitHub |

2w

Long-context LLMs can process a lot of data, but their is still a practical limit to the amount of context they can handle efficiently. RAG can selectively retrieve relevant information, reducing the need to process the large amount of irrelevant data. While long-context LLMs reduce the need for some of the traditional uses of RAG, combining both can leverage the strengths of each approach to create more powerful and efficient AI systems.

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics