Leveraging Larger Context Windows in RAG: Benefits and Cost Considerations

Pawel Sobczak

Inspiring innovation in AI for Business | AI investor and consultant | ex-IBM Vice President AI Build Partnerships EMEA

Published Mar 21, 2024

Introduction:

Large Language Models (LLMs) are constantly evolving, increasing for example size of context window. Retrieval-Augmented Generation (RAG) is a powerful technique that enhances LLM performance by incorporating information from external knowledge bases. Very often it would be proprietary (confidential) company data that cannot be part of pile of used for training publicly available LLMs. RAG is an alternative or complementary method to prompt and model tuning, allowing for a more comprehensive understanding of the context and can improve the accuracy and relevance of LLM outputs.

Context Windows and RAG:

Context windows refer to the amount of text an LLM can consider during generation. Larger context windows enable LLMs to process more information from retrieved data in RAG scenarios. In extreme examples loading whole book or movie script with thousands of pages of text. This leads to several benefits:

* Improved Factual Accuracy: With a larger context window, the LLM can consider more retrieved information, reducing the risk of factual inaccuracies or irrelevant outputs (aka hallucinations).

* Deeper Understanding of Complexities: Larger context windows allow the LLM to grasp intricate connections within the retrieved information, improving its understanding of complex relationships.

* Reduced Ranking Needs: In theory, a large enough context window could potentially allow the LLM to directly analyze and prioritize relevant information from retrieved data, potentially reducing the need for a separate ranking step before feeding information into the prompt. This depends on the LLM's capabilities and task complexity. I am not suggesting including all private data in prompt as it may lead to information overload (causing confusion and worse results), passing irrelevant details and increase LLM processing times slowing down the RAG process. Some level pre-processing is however still needed.

Few examples of models and their context window size:

AWS and Microsoft are not listed as they do not promote own LLMs. AWS is providing access to multiple models thru their Bedrock platform. Microsoft is relying on partnership with OpenAI and their LLMs, focusing own development on very small model Phi (1.3B and 2.7B parameters). I wrote about Large vs. Small language models HERE. Most of the vendors offer access to multiple foundation models incl. open source and client-developed own models.

Cost Considerations:

Many LLM providers charge based on the number of tokens processed during inference (generation). Since RAG involves feeding the LLM additional information from the knowledge base, the overall input token count increases. This translates to higher inference costs for cloud-based LLMs with per-token pricing.   While the cost per million input tokens might seem relatively low (just a few dollars), these costs can accumulate rapidly in a production environment with thousands of users.

On-Premise Deployment as an Alternative:

On-premise deployment, where you run LLMs on your own infrastructure, offers an alternative. While it requires an upfront investment in AI hardware, it eliminates per-token pricing. This opens doors for:

* Larger Context Windows: You're not limited by cost per token, allowing you to leverage a larger knowledge base for RAG, providing the LLM with a richer context for even more accurate generation.

* Private Data Integration: On-premise deployment allows you to incorporate your own private datasets into the RAG knowledge base. This can be invaluable for tasks requiring domain-specific knowledge or handling sensitive data that can't be shared in the cloud.

Cloud vs. On-Premise: Balancing Cost and Capability

The best deployment model depends on your specific needs. Consider these factors:

* Value of Context: How crucial is contextual accuracy for your use case?

* Data Privacy and Security: Do you need to incorporate sensitive data?

* Total Cost of Ownership (TCO): Compare cloud-based inference costs with the upfront hardware investment, maintenance, and ongoing operational expenses of on-premise deployment.

Conclusion:

RAG is a powerful tool for enhancing LLM capabilities. Larger context window is clear benefit in RAG scenario but has cost implications (count of input tokens). On prem implementations can eliminate per-token pricing but require investment in AI HW infrastruture (or renting IaaS). Carefully evaluate your use case, data privacy needs, and TCO to choose the deployment model (cloud-based or on-premise) that best balances cost and benefits.

Have you explored RAG for your LLMs? Are you leveraging larger context window in latest LLMs ? How did you decide between cloud-based and on-premise deployment? Let's discuss in the comments below!

Mohita Narang

5mo

From my experience, one critical aspect of deploying these AI-driven applications is the computational power required to train and run these models efficiently. This is where GPU-accelerated computing comes into play. Paperspace's GPU services have been invaluable to my team and me. Their cloud-based platform offers easy access to powerful NVIDIA GPUs, including the H100, which are ideal for handling the intensive computational demands of LLMs and RAG systems. Try NVIDIA H100 GPU on the Paperspace platform at affordable prices.https://www.paperspace.com/. https://bit.ly/3whNViA. Also, try their free GPU https://www.paperspace.com/gradient/free-gpu. Note: Paperspace and DigitalOcean are the same.

Leveraging Larger Context Windows in RAG: Benefits and Cost Considerations

Pawel Sobczak

Inspiring innovation in AI for Business | AI investor and consultant | ex-IBM Vice President AI Build Partnerships EMEA

Introduction:

Context Windows and RAG:

Cost Considerations:

On-Premise Deployment as an Alternative:

Cloud vs. On-Premise: Balancing Cost and Capability

Conclusion:

More articles by this author

Insights from the community

Others also viewed

Revolutionizing Language Models: Unleashing Supercharged Framework

How to scale Large Language Models (LLMs) to infinite context?

Demystifying Large Language Models: A Beginner's Guide to the Hype

Cost-Saving Strategies for Large Language Models(LLMs) - Part 1

Trends in LLMs - Do Longer Context Windows Matter?

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

LLMs -X- GraphDB(Neo4j): Enhancing Retrieval-Augmented Generation (RAG)

Maximizing Effectiveness of Large Language Models (LLMs): Fine-Tuning Methods

You Shouldn't Fine-tune LLMs for Domain-specific Tasks!?

The Fine-Turning an Open Source Language Model Journey Part One: Impetus

Explore topics

Introduction:

Context Windows and RAG:

Cost Considerations:

On-Premise Deployment as an Alternative:

Cloud vs. On-Premise: Balancing Cost and Capability

Conclusion:

Vacation time to recharge batteries is over - 10 key AI developments you might have missed this summer

Aug 27, 2024

How good is your AI when your cloud is down?

Jul 19, 2024

The best AI programming language is…

Jul 11, 2024

AI Agents: from Chatbots to Autonomous Co-workers

May 9, 2024

Llama 3 and How Open-Source LLMs Reshape Enterprise AI

Apr 22, 2024

AI and the Art of Reasoning

Apr 12, 2024

Personal GenAI in Your Pocket: LLMs and RAG on mobile device?

Apr 4, 2024

The Quantum Leap: How AI and Quantum Computing can change the game

Mar 13, 2024

The Marriage of AI and Blockchain: Shaping the Future of Business

Mar 4, 2024

The Dark Side of the Machine: When AI Hallucinates and Chatbots Go Rogue

Feb 26, 2024

Insights from the community

Others also viewed

Revolutionizing Language Models: Unleashing Supercharged Framework

How to scale Large Language Models (LLMs) to infinite context?

Demystifying Large Language Models: A Beginner's Guide to the Hype

Cost-Saving Strategies for Large Language Models(LLMs) - Part 1

Trends in LLMs - Do Longer Context Windows Matter?

OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework

LLMs -X- GraphDB(Neo4j): Enhancing Retrieval-Augmented Generation (RAG)

Maximizing Effectiveness of Large Language Models (LLMs): Fine-Tuning Methods

You Shouldn't Fine-tune LLMs for Domain-specific Tasks!?

The Fine-Turning an Open Source Language Model Journey Part One: Impetus

Explore topics