LlamaIndex’s Post

View organization page for LlamaIndex, graphic

199,093 followers

The future of document RAG is multimodal RAG 🖼️📑. In our brand-new cookbook we highlight a new multi-modal RAG architecture for processing a slide deck (heavy in text, diagrams, charts, tables) using LlamaParse, LlamaIndex, and gpt-4o. At the core is a hybrid text/image chunk. In contrast to standard RAG pipelines that only index/synthesize over text chunks, our multi-modal RAG setup does the following:  1. For each page, both a) parse out the text, but also b) screenshot each page as a separate image. You can use standard OCR techniques but also multimodal models for extraction. 2. Create a hybrid chunk that contains the parsed text, with a file link to the saved image 3. Use text embeddings to embed the chunk (note: you can use image embeddings as well, but we find gaps here)  4. Retrieve relevant chunks by text embeddings. 5. During synthesis, load in the text and image. This combines the benefits of text parsing (well studied), with the visual recognition benefits of multimodal models. We have a full cookbook showing how you can leverage LlamaParse to do both the text/image extraction, and then set it up with gpt-4o for a full pipeline. Check it out: https://lnkd.in/gjaFvwYE

  • No alternative text description for this image
Prateek Bansal

Co-Founder, Grey Chain | Ex-BCG GenAI Product Director | AI Speaker | GBV Champion

1mo

We have done this successfully and have achieved amazing relevance using a similar technique. We parse complex documents through vision models and store the chunk as well. That provides added context. It works brilliantly for flow charts, graphs, etc. The real world is very complex as compared to RAG that users think is just doc parsing. There are 100s of nuances to take care of when we want to get good results. Although this does result in increased costs and hence we have built a layer to decide if a particular page of a doc needs vision or not. It does not make sense to parse the whole doc.

Farhad Davaripour, Ph.D.

Lead Data Scientist at Arcurve | GenAI & Simulation Specialist | Sessional Instructor at University of Calgary | Mentor

1mo

Excellent. Thanks. FYI: The Colab link in the repo doesn't work.

Like
Reply

This is really cool! I haven’t done direct API calls to GPT-4o with images yet (besides whatever LlamaParse does). I’m curious about the token expense and latency for processing a pdf or ppt page image versus sending the parsed content as text tokens. Initial thought is to pay once with LlamaParse + got-4o to extract the content, but you’re probably right this is where we’re headed in a year or so.

Victory Adugbo

Growth Marketing Leader & Business Developer || Expert in Hacking Business Growth in AI, Web3, and FinTech Companies || Automation Expert

1mo

Multi-modal RAG improves efficiency by leveraging both text and visual data, enhancing information synthesis. By parsing text and capturing page screenshots, it combines text parsing with visual recognition. This approach ensures more accurate and comprehensive information retrieval and generation, particularly for documents with complex visual elements.

Like
Reply

The significance of a multi-modal Retrieval-Augmented Generation (RAG) architecture lies in its ability to process complex documents that contain both text and visual elements, like diagrams, charts, and tables. By parsing text and capturing screenshots of each page, it combines text parsing with visual recognition, enhancing the synthesis of information. This approach improves the efficiency and accuracy of information retrieval and generation in applications requiring detailed document analysis.

Like
Reply
Syed Misbah

Data Science and Engineering Manager @ DISH Network | GenAI, Classical Machine Learning, Statistics

1mo

Apart from llama parse, are there any good recommendations for a self hosted parser?

Sanjay Balikar

Data Science | ML enthusiastic | Problem Solver | Innovation catalyst | Code & Cognition | GenAI

1mo

The multimodal capabilities of this system unlock innovative solutions by seamlessly integrating diverse data types. With #LlamaIndex, this integration is streamlined into a single, efficient pipeline, greatly simplifying the process.

Like
Reply
Gajanan Kulkarni

Generative AI,LLM, Python,Service Automation,Cognitive RPA,AI,MLOps,LLMOps,ITIL 4,COBIT5, Servicenow GRC,Secops

1mo

How to make this work on IT knowledge articles where steps to fix the issue are given in screenshots?

Kartik Saha

Technical Lead in AI @HCL Tech R&D-Centre of Excellence || Generative AI | LLMops | MLOps | DL | NLP ||

1mo

If we could also retrived accurate screenshot image using OCR based on query related to unstructured/structured data in document in response along with text then that would be much helpful.

Like
Reply
Refat Ametov

Driving Business Automation & AI Integration | Co-founder of Devstark and SpreadSimple | Stoic Mindset

1mo

This is an interesting approach to multimodal RAG! Combining text and image parsing in one pipeline seems very useful for document processing. How does the accuracy of this hybrid method compare to traditional text-only RAG pipelines? Do you see any challenges or limitations with this method? 

See more comments

To view or add a comment, sign in

Explore topics