LlamaIndex’s Post

View organization page for LlamaIndex, graphic

199,093 followers

1mo

The future of document RAG is multimodal RAG 🖼️📑. In our brand-new cookbook we highlight a new multi-modal RAG architecture for processing a slide deck (heavy in text, diagrams, charts, tables) using LlamaParse, LlamaIndex, and gpt-4o. At the core is a hybrid text/image chunk. In contrast to standard RAG pipelines that only index/synthesize over text chunks, our multi-modal RAG setup does the following: 1. For each page, both a) parse out the text, but also b) screenshot each page as a separate image. You can use standard OCR techniques but also multimodal models for extraction. 2. Create a hybrid chunk that contains the parsed text, with a file link to the saved image 3. Use text embeddings to embed the chunk (note: you can use image embeddings as well, but we find gaps here) 4. Retrieve relevant chunks by text embeddings. 5. During synthesis, load in the text and image. This combines the benefits of text parsing (well studied), with the visual recognition benefits of multimodal models. We have a full cookbook showing how you can leverage LlamaParse to do both the text/image extraction, and then set it up with gpt-4o for a full pipeline. Check it out: https://lnkd.in/gjaFvwYE

21 Comments

Prateek Bansal

Co-Founder, Grey Chain | Ex-BCG GenAI Product Director | AI Speaker | GBV Champion

1mo

We have done this successfully and have achieved amazing relevance using a similar technique. We parse complex documents through vision models and store the chunk as well. That provides added context. It works brilliantly for flow charts, graphs, etc. The real world is very complex as compared to RAG that users think is just doc parsing. There are 100s of nuances to take care of when we want to get good results. Although this does result in increased costs and hence we have built a layer to decide if a particular page of a doc needs vision or not. It does not make sense to parse the whole doc.

7 Reactions

Farhad Davaripour, Ph.D.

Lead Data Scientist at Arcurve | GenAI & Simulation Specialist | Sessional Instructor at University of Calgary | Mentor

1mo

Excellent. Thanks. FYI: The Colab link in the repo doesn't work.

Sean Bergman

1mo

This is really cool! I haven’t done direct API calls to GPT-4o with images yet (besides whatever LlamaParse does). I’m curious about the token expense and latency for processing a pdf or ppt page image versus sending the parsed content as text tokens. Initial thought is to pay once with LlamaParse + got-4o to extract the content, but you’re probably right this is where we’re headed in a year or so.

1 Reaction

Victory Adugbo

Growth Marketing Leader & Business Developer || Expert in Hacking Business Growth in AI, Web3, and FinTech Companies || Automation Expert

1mo

Multi-modal RAG improves efficiency by leveraging both text and visual data, enhancing information synthesis. By parsing text and capturing page screenshots, it combines text parsing with visual recognition. This approach ensures more accurate and comprehensive information retrieval and generation, particularly for documents with complex visual elements.

Cohorte

1mo

The significance of a multi-modal Retrieval-Augmented Generation (RAG) architecture lies in its ability to process complex documents that contain both text and visual elements, like diagrams, charts, and tables. By parsing text and capturing screenshots of each page, it combines text parsing with visual recognition, enhancing the synthesis of information. This approach improves the efficiency and accuracy of information retrieval and generation in applications requiring detailed document analysis.

Syed Misbah

Data Science and Engineering Manager @ DISH Network | GenAI, Classical Machine Learning, Statistics

1mo

Apart from llama parse, are there any good recommendations for a self hosted parser?

1 Reaction

Sanjay Balikar

1mo

The multimodal capabilities of this system unlock innovative solutions by seamlessly integrating diverse data types. With #LlamaIndex, this integration is streamlined into a single, efficient pipeline, greatly simplifying the process.

Gajanan Kulkarni

Generative AI,LLM, Python,Service Automation,Cognitive RPA,AI,MLOps,LLMOps,ITIL 4,COBIT5, Servicenow GRC,Secops

1mo

How to make this work on IT knowledge articles where steps to fix the issue are given in screenshots?

1 Reaction

Kartik Saha

Technical Lead in AI @HCL Tech R&D-Centre of Excellence || Generative AI | LLMops | MLOps | DL | NLP ||

1mo

If we could also retrived accurate screenshot image using OCR based on query related to unstructured/structured data in document in response along with text then that would be much helpful.

Refat Ametov

Driving Business Automation & AI Integration | Co-founder of Devstark and SpreadSimple | Stoic Mindset

1mo

This is an interesting approach to multimodal RAG! Combining text and image parsing in one pipeline seems very useful for document processing. How does the accuracy of this hybrid method compare to traditional text-only RAG pipelines? Do you see any challenges or limitations with this method?

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Paulo Cysne

Data Science Leader | 25,200+ followers | Transforming businesses with AI / Data Science / Machine Learning
1mo
Report this post
The future of document RAG is multimodal RAG 🖼️📑. In this brand-new cookbook: A new multi-modal RAG architecture for processing a slide deck (heavy in text, diagrams, charts, tables) using LlamaParse, LlamaIndex, and gpt-4o.
LlamaIndex

199,093 followers
1mo

The future of document RAG is multimodal RAG 🖼️📑. In our brand-new cookbook we highlight a new multi-modal RAG architecture for processing a slide deck (heavy in text, diagrams, charts, tables) using LlamaParse, LlamaIndex, and gpt-4o. At the core is a hybrid text/image chunk. In contrast to standard RAG pipelines that only index/synthesize over text chunks, our multi-modal RAG setup does the following: 1. For each page, both a) parse out the text, but also b) screenshot each page as a separate image. You can use standard OCR techniques but also multimodal models for extraction. 2. Create a hybrid chunk that contains the parsed text, with a file link to the saved image 3. Use text embeddings to embed the chunk (note: you can use image embeddings as well, but we find gaps here) 4. Retrieve relevant chunks by text embeddings. 5. During synthesis, load in the text and image. This combines the benefits of text parsing (well studied), with the visual recognition benefits of multimodal models. We have a full cookbook showing how you can leverage LlamaParse to do both the text/image extraction, and then set it up with gpt-4o for a full pipeline. Check it out: https://lnkd.in/gjaFvwYE
Like Comment
To view or add a comment, sign in
David K.

🚀 LLMs & NLP Innovator | AI & Big Data Engineering Leader | Python Back-end Expert | 15+ Years in Tech | Speaker & Mentor
1mo
Report this post
Multi-modal #RAG #architecture for processing slide decks, incorporating both #text and visual data. It highlights the use of LlamaParse, LlamaIndex, and #GPT4o to create a hybrid text/image chunk for more effective information retrieval and synthesis. #MLDK #mldktech #MldkRag https://mldk.tech
LlamaIndex

199,093 followers
1mo

The future of document RAG is multimodal RAG 🖼️📑. In our brand-new cookbook we highlight a new multi-modal RAG architecture for processing a slide deck (heavy in text, diagrams, charts, tables) using LlamaParse, LlamaIndex, and gpt-4o. At the core is a hybrid text/image chunk. In contrast to standard RAG pipelines that only index/synthesize over text chunks, our multi-modal RAG setup does the following: 1. For each page, both a) parse out the text, but also b) screenshot each page as a separate image. You can use standard OCR techniques but also multimodal models for extraction. 2. Create a hybrid chunk that contains the parsed text, with a file link to the saved image 3. Use text embeddings to embed the chunk (note: you can use image embeddings as well, but we find gaps here) 4. Retrieve relevant chunks by text embeddings. 5. During synthesis, load in the text and image. This combines the benefits of text parsing (well studied), with the visual recognition benefits of multimodal models. We have a full cookbook showing how you can leverage LlamaParse to do both the text/image extraction, and then set it up with gpt-4o for a full pipeline. Check it out: https://lnkd.in/gjaFvwYE
Like Comment
To view or add a comment, sign in
Merve Noyan

open-sourceress at 🤗 | Google Developer Expert in Machine Learning, MSc Candidate in Data Science
7mo Edited
Report this post
Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉 OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first. 📝 OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training. 👀 What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together. Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP). They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune. During fine-tuning for object detection, they calculate the loss over bipartite matches. Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth. OWL-ViT is very scalable. You can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need weak supervision. Moreover, only scaling the encoders creates a bottleneck after a while. The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data. Thanks to this, OWLv2 scaled very well and topped leaderboards on open vocabulary object detection 👑 If you'd like to try it out, I will leave couple of links with apps, notebooks and more in the comments! 🤗 Hugging Face
8 Comments
Like Comment
To view or add a comment, sign in
Dev Khant

Founding AI Engineer @Mem0 (YC S24) | Kaggle 4x Expert | ML and Deep Learning enthusiast
6mo
Report this post
𝗬𝗢𝗟𝗢-𝗪𝗼𝗿𝗹𝗱 ✨ It's a real-time open-vocabulary object detection model. YOLO-World is a 𝘇𝗲𝗿𝗼-𝘀𝗵𝗼𝘁 𝗺𝗼𝗱𝗲𝗹, which means you can run object detection without any training. ✨ Paper introduces the 𝗽𝗿𝗼𝗺𝗽𝘁-𝘁𝗵𝗲𝗻-𝗱𝗲𝘁𝗲𝗰𝘁 paradigm, an approach that avoids the need for real-time text encoding. Instead, it first encodes the prompts of a user to build an offline vocabulary and the vocabulary varies with different needs. ✨ YOLO-World's architecture consists of three key elements : YOLO detector, Text Encoder, Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN). ✨ 𝗬𝗢𝗟𝗢 𝗱𝗲𝘁𝗲𝗰𝘁𝗼𝗿: It is based on YOLOv8 which extracts the multi-scale features from the input image. ✨ 𝗧𝗲𝘅𝘁 𝗘𝗻𝗰𝗼𝗱𝗲𝗿: Transformer text encoder pre-trained by OpenAI’s CLIP which encodes the text into text embeddings. ✨ 𝗥𝗲-𝗽𝗮𝗿𝗮𝗺𝗲𝘁𝗲𝗿𝗶𝘇𝗮𝗯𝗹𝗲 𝗩𝗶𝘀𝗶𝗼𝗻-𝗟𝗮𝗻𝗴𝘂𝗮𝗴𝗲 𝗣𝗮𝘁𝗵 𝗔𝗴𝗴𝗿𝗲𝗴𝗮𝘁𝗶𝗼𝗻 𝗡𝗲𝘁𝘄𝗼𝗿𝗸 (𝗥𝗲𝗽𝗩𝗟-𝗣𝗔𝗡): It performs multi-level cross-modality fusion between image features and text embeddings. ✨ YOLO-World reached between 𝟯𝟱.𝟰 𝗔𝗣 𝘄𝗶𝘁𝗵 𝟱𝟮.𝟬 𝗙𝗣𝗦for the large version and 𝟮𝟲.𝟮 𝗔𝗣 𝘄𝗶𝘁𝗵 𝟳𝟰.𝟭 𝗙𝗣𝗦for the small version. Try it out: https://lnkd.in/dZKaXH3V Paper: https://lnkd.in/djmYmVts
Like Comment
To view or add a comment, sign in
Mohammad Sefidgar

Data And Computer Vision Scientist | 🚀 Top Machine Learning Voice 🚀
7mo
Report this post
👇 Follow me for more interesting contents. 👇 GitHub: 👉 https://lnkd.in/e7hZvG5Q LinkedIn:👉 https://lnkd.in/ej-jHj9s Google Scholar: 👉 https://lnkd.in/eQM_qKw7 #computervision #technologie #development #university #share #robotics #ai #google #datascience #data #datascientist #AI #ML #ArtificialIntelligence #MachineLearning #modeldeployment #reinforcementlearning #objectdetection #bourse #recherche #workshop #reinforcementlearning #SQL #nosql #sqlite #deeplearning
Merve Noyan

open-sourceress at 🤗 | Google Developer Expert in Machine Learning, MSc Candidate in Data Science
7mo Edited

Explaining the 👑 of zero-shot open-vocabulary object detection: OWLv2 🦉 OWLv2 is scaled version of a model called OWL-ViT, so let's take a look at that first. 📝 OWLViT is an open vocabulary object detector, meaning, it can detect objects it didn't explicitly see during the training. 👀 What's cool is that it can take both image and text queries! This is thanks to how the image and text features aren't fused together. Taking a look at the architecture, the authors firstly do contrastive pre-training of a vision and a text encoder (just like CLIP). They take that model, remove the final pooling layer and attach a lightweight classification and box detection head and fine-tune. During fine-tuning for object detection, they calculate the loss over bipartite matches. Simply put, loss is calculated over the predicted objects against ground truth objects and the goal is to find a perfect match of these two sets where each object is matched to one object in ground truth. OWL-ViT is very scalable. You can easily scale most language models or vision-language models because they require no supervision, but this isn't the case for object detection: you still need weak supervision. Moreover, only scaling the encoders creates a bottleneck after a while. The authors wanted to scale OWL-ViT with more data, so they used OWL-ViT for labelling to train a better detector, "self-train" a new detector on the labels, and fine-tune the model on human-annotated data. Thanks to this, OWLv2 scaled very well and topped leaderboards on open vocabulary object detection 👑 If you'd like to try it out, I will leave couple of links with apps, notebooks and more in the comments! 🤗 Hugging Face
Like Comment
To view or add a comment, sign in
Sharath S Hebbar

Data Scientist | Marketing Analytics | Generative AI
7mo
Report this post
BERT with longer sequence length BERT has a maximum token limit of 512, posing a challenge when dealing with lengthy documents. Extractive summarization, document classification, and other tasks requiring contextual understanding face constraints. We were overcoming this problem by chunking the dataset and then using them as the context, but now Together AI has come up with a larger sequence length for BERT based model. PFA Links for reference 2k sequence length: https://lnkd.in/d7Myd2d8 8k sequence length: https://lnkd.in/dcykkNde 32k sequence length: https://lnkd.in/dcykkNde #transformers #attention #machinelearning #deeplearning #llm #naturallanguageprocessing

togethercomputer/m2-bert-80M-8k-retrieval · Hugging Face

huggingface.co
Like Comment
To view or add a comment, sign in
Narongthat Thanyawet

Ph.D. Candidate | AI and Analytics Manager | Co-Founder
9mo
Report this post
Multi modal for images and text (like image captioning with automated citation. )
LlamaIndex

199,093 followers
9mo

This is one of the best overviews on the emerging multi-modal RAG architecture that we've seen 👇 RS Rohan has a wonderful diagram and guide outlining how to jointly index/retrieve/query both images and text: ✅ Index image data with CLIP embeddings, text data with text embeddings ✅ Given a user query, retrieve images and/or text ✅ Synthesize a final context using a multi-modal model like GPT-V, LLaVa Check out the full thread here for more details: https://lnkd.in/gW76CRJb
Like Comment
To view or add a comment, sign in
Subco Engineering

192 followers
2mo
Report this post
How to interpolate data from within Engineering design standards using a simple screenshot - Using GPT-4o. 📽️ What this video shows: - how to prompt a LLM with image recognition to manipulate data within a table in any Engineering design standard - how to follow-on-prompt to refine this data analysis to avoid discrepancies - how to present this information using GPT's in build charting feature - how to extract the data from a screenshot into a .csv to either validate your answers or as a quick way to tabulate information from .pdf into a spreadsheet 💡 The prompt/s: - Using the data within the table in this screenshot, determine the dynamic factor (via interpolation) at an Hs=3.0m - Can you use a line of best fit to determine these values? Using a 3rd order polynomial. Show the graph and present the dynamic factor for the Submerged loadcase only - present the screenshot data as tabulated information that I can extract and paste to excel to verify these results 🌟 The results: - 100% accurate...as we'd expect as the instruction matched the line of best fit within excel As many will recognise this is the Sea State and Hs vs dynamic factor table within Lloyds LAME, it's certainly not a hard task to extract the data and perform the interpolation yourself but the video hopefully showcases other potential uses within any other Engineering standard. www.subco-engineering.com #EngineeringInnovation #AIinEngineering #DataAnalysis #GPT4o #SubcoEngineering #TechRevolution #FutureOfWork #EngineeringExcellence

Data Interpolation using GPT-4o
Like Comment
To view or add a comment, sign in
Antonio Montano 🪄

Delivering perpetual agility via technology ✨
1w
Report this post
💥 Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with new distillation method: MOHAWK! MOHAWK first matches the student's matrix mixers to the teacher's, then the hidden states at the end of each block, and finally the end-to-end model logits. Matrix Orientation + Hidden-state Alignment + Weight-transfer and Knowledge distillation = MOHAWK A fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens is also released! 👉 Transformers to SSMs: Distilling Quadratic Knowledge to Subquadratic Models Aviv Bick, Kevin Y. Li, Eric P. Xing, J. Zico Kolter, Albert Gu Abstract Transformer architectures have become a dominant paradigm for domains like language modeling but suffer in many inference settings due to their quadratic-time self-attention. Recently proposed subquadratic architectures, such as Mamba, have shown promise, but have been pretrained with substantially less computational resources than the strongest Transformer models. In this work, we present a method that is able to distill a pretrained Transformer architecture into alternative architectures such as state space models (SSMs). The key idea to our approach is that we can view both Transformers and SSMs as applying different forms of mixing matrices over the token sequences. We can thus progressively distill the Transformer architecture by matching different degrees of granularity in the SSM: first matching the mixing matrices themselves, then the hidden units at each block, and finally the end-to-end predictions. Our method, called MOHAWK, is able to distill a Mamba-2 variant based on the Phi-1.5 architecture (Phi-Mamba) using only 3B tokens and a hybrid version (Hybrid Phi-Mamba) using 5B tokens. Despite using less than 1% of the training data typically used to train models from scratch, Phi-Mamba boasts substantially stronger performance compared to all past open-source non-Transformer models. MOHAWK allows models like SSMs to leverage computational resources invested in training Transformer-based architectures, highlighting a new avenue for building such models. 👉 https://lnkd.in/dnZyV4sm #machinelearning
Like Comment
To view or add a comment, sign in
Paulo Cysne

Data Science Leader | 25,200+ followers | Transforming businesses with AI / Data Science / Machine Learning
9mo
Report this post
This is one of the best overviews on the emerging multi-modal RAG architecture
LlamaIndex

199,093 followers
9mo

This is one of the best overviews on the emerging multi-modal RAG architecture that we've seen 👇 RS Rohan has a wonderful diagram and guide outlining how to jointly index/retrieve/query both images and text: ✅ Index image data with CLIP embeddings, text data with text embeddings ✅ Given a user query, retrieve images and/or text ✅ Synthesize a final context using a multi-modal model like GPT-V, LLaVa Check out the full thread here for more details: https://lnkd.in/gW76CRJb
Like Comment
To view or add a comment, sign in

199,093 followers

View Profile Follow

LlamaIndex’s Post

More Relevant Posts

Data Interpolation using GPT-4o

Explore topics