Stefano Fiorucci’s Post

Contributing to Haystack, the LLM Framework 🏗️ | NLP Engineer, Craftsman and Explorer 🧭

2w Edited

🚧 𝐔𝐩𝐝𝐚𝐭𝐞: Recent evaluations have raised questions about the validity of BM42. Future developments may address these concerns. Please consider this when reading the post. --- 👋 Say hello to BM42 🔎 Qdrant introduced BM42, an algorithm that aims to replace BM25 in hybrid RAG pipelines (dense + sparse retrieval). They found that BM25, while relevant for a long time, has some limitations in common RAG scenarios. To understand the motivation and inspiration for BM42, let's first examine BM25 and SPLADE. 👇 𝐁𝐌25 BM25 is an evolution of TF-IDF and has two components: - Inverse Document Frequency = term importance within a collection - a component incorporating Term Frequency = term importance within a document ❌ Qdrant folks observed that the TF component relies heavily on document statistics, which only makes sense for longer texts. This is not the case with common RAG pipelines, where documents are short. 𝐒𝐏𝐋𝐀𝐃𝐄 SPLADE takes a different approach, using a BERT-based model to create a bag-of-words representation of the text. While it generally performs better than BM25, it has some drawbacks: ⚠️ tokenization issues with out-of-vocabulary words ⚠️ adaptation to new domains requires fine-tuning ⚠️ computationally heavy 𝐁𝐌42 Taking inspiration from SPLADE, the Qdrant team developed BM42 to improve BM25. IDF works well, so they kept it. ✅ But how to quantify term importance within a document❓ 💡 The solution lies in the attention matrix of Transformer models: we can use the attention row for the [CLS] token! To fix tokenization issues, BM42 merges subwords and sums their attention weights. In their implementation, Qdrant used all-MiniLM-L6-v2 model, but this technique can work with any Transformer, no fine-tuning needed. We've already integrated BM42 into the #haystack LLM orchestration framework. ⚡ 𝐁𝐌42 𝐢𝐧 𝐚𝐜𝐭𝐢𝐨𝐧: Haystack + Qdrant + FastEmbed 📓 https://lnkd.in/dSMigDWf 🗂️ Resources in the comments 💬 #rag #informationretrieval #transformers #nlp

22 Comments

Stefano Fiorucci

Contributing to Haystack, the LLM Framework 🏗️ | NLP Engineer, Craftsman and Explorer 🧭

🗂️ Resources: 🏎️ BM42: New Baseline for Hybrid Search - great article by Andrey Vasnetsov https://qdrant.tech/articles/bm42/ ⚡ BM42 in action: Haystack + Qdrant + FastEmbed 📓 https://github.com/deepset-ai/haystack-cookbook/blob/main/notebooks/hybrid_retrieval_bm42.ipynb 💊 Intro to Bag-of-words and TF-IDF: https://github.com/anakin87/neural-search-pills/blob/main/pills/sparse-bow-tfidf.md 💊 Intro to BM25: https://github.com/anakin87/neural-search-pills/blob/main/pills/sparse-bm25.md

9 Reactions

Tony Wang

random data stuff

Does this require a transformer inference pass on every document indexed?

Junte Zhang

Making new things possible with search engines

Any comparisons with this metric compared with BM25 on existing test collections? How much difference is there in relative rank?

2 Reactions

Francesco Saverio Zuppichini

yeah but it way slower right? Like you need to run a model that has the [CLS] token so you need to keep one more model online

3 Reactions

John Ryan

Director of Engineering | AI | Search Relevance

QDrants implementation of BM42 is certainly a step in the right direction. The importance of a term relative to the corpus is where BM25 gets its true power and I don't think it's quite there yet as a replacement. In the Expert Network industry a main concern is the ever changing nature of the data. Hot Topics and the constant evolution of expert profiles mean we are constantly updating our search data. Having to perform inferences on every document on each update is a big hit where ingestion and query speed are crucial to achieve high quality relevant searches. Still love the way this space is going.

Cohorte

Transformer models significantly influenced BM42's development by providing a novel way to quantify term importance within documents. Stefano Fiorucci's post highlights how BM42 utilizes the attention matrix from Transformer models, specifically the attention row for the [CLS] token, to improve upon BM25's limitations. This approach, inspired by SPLADE but avoiding its drawbacks, allows for effective document relevance scoring without the need for fine-tuning, making it a versatile solution for various document lengths.

Alex G.

AI Research YouTube Videos 0day News

https://github.com/vtempest/Wiki-BM25-search Calculate term specificity for a single doc with BM25 formula by using Wikipedia IDF - problem with BM25 and TF-IDF is that a large set of documents is needed https://github.com/vtempest/Wiki-BM25-search/blob/master/test/bm42.js I have benchmarked and wrote bm42.js implemented in transformers.js with BERT model. It does not seem fully accurate yet in setting weights / ner

1 Reaction

Andre Zayarni

Co-founder at Qdrant, Vector Database.

Superfast integration! Qdrant deepset 🚀

7 Reactions

Vladimir Blagojevic

Senior Software Engineer @ deepset

Wow, cool Stefano Fiorucci - moving super fast 🚀

4 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Alexander Golubev

Machine Learning Engineer @ Constructor.io
7mo
Report this post
🛠️ Adding external tools to LLMs/VLMs provides exciting opportunities for models to perform various tasks. A great example that I’d like to highlight today is the recently published paper — 'LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents.' The authors extended the capabilities of the LLaVa multimodal model by adding a set of different tools accessible by the main model. This turned out to be quite interesting. Now, user interaction with the model consists not just of a question (text + picture) and an answer, but of 4 steps: ✅ The person provides a task (X_q) and a picture (I_q). ✅ The assistant processes all the information and generates X_skill_use, a set of tools that is needed to accomplish the task (this can be empty). ✅ Once a particular tool is used, the X_result is fed back to the assistant as input. ✅ Aggregating all the available information, the model produces the final result. In the accompanying diagram you can see this interaction. Sequences, where loss is computed during training, are marked in green, i.e., the model learns to predict the set of tools and the final answer. As for the tools themselves, many options are added: generation via Stable Diffusion, OCR, segmentation via SAM, ControlNet, Pix2Pix, and many more. Everything from the model to the training code and a new dataset with instructions is publicly available. #llm #vlm #multimodalai #nlp #llava #computervision
1 Comment
Like Comment
To view or add a comment, sign in
Abonia Sojasingarayar

Machine Learning Scientist | Data Scientist | NLP Engineer | Computer Vision Engineer | AI Analyst | Technical Writer | Technical Book Reviewer
6mo
Report this post
📈 BERTScore - Evaluating Text Generation with BERT 📊 BERTScore leverages the pre-trained BERT model to score the semantic similarity between two pieces of text. It uses cosine similarity over the pooled output, cls token, and average of all tokens. This approach provides a more accurate representation of human judgment compared to traditional methods like ROUGE, METEOR and CIDEr. 🎯 BERTScore calculates precision and recall through a process known as token matching as below: ✅ Contextual Embeddings ✅ Cosine Similarity ✅ Token Matching for Precision and Recall ✅ Importance Weighting ✅ Baseline Rescaling ⚠️ Limitations and potential biases ➡️ It can be biased towards models that are more similar to its own underlying model. This is because the metrics can favor their own outputs and other outputs which are more similar to them. ➡️ The ability of reference-free metrics to evaluate other models is inherently limited by the qualities of their pseudo-references. If a system outputs a translation or a summary which is higher-quality than the pseudo-reference, it will be incorrectly penalized because it is different from the pseudo-reference, even though those differences are actually improvements. ➡️ BERTScore doesn't take into account the syntactic structure of the sentence. It can lead to incorrect evaluations in cases where the syntactic structure of the sentences is different but they convey the same meaning. ➡️ It might perform poorly on tasks that require understanding the context beyond the individual words, such as idiomatic expressions or cultural references 📌 For further insights, please refer to the complete article and tutorial: https://lnkd.in/eTUzYPuG 📌 Paper: https://lnkd.in/eF-FgK4r 📌 Github: https://lnkd.in/expsDDPZ #LLM #GenAI #BERTScore #NLP #MachineLearning #Research
2 Comments
Like Comment
To view or add a comment, sign in
João Rocha e Melo

Co-Founder at Pollock Labs | GenAI
8mo
Report this post
Confused about what RAG—Retrieval Augmented Generation—actually does? In essence, RAG is a technique that supercharges text generation models by providing them with more contextually rich prompts. The mechanics involve two pivotal components: 📚 Step 1: Retrieve pertinent chunks of text from a vast pool of documents based on your initial query. 🔄 Step 2: Compile a prompt that incorporates the original context, the retrieved documents, and the user's query. This approach leverages the large context windows that advanced LLMs (like GPT-4 with 32k tokens and Claude with 100k tokens) possess. It fetches closely related documents to enhance the quality of the model's output. The result? More precise and contextually relevant answers. RAG is particularly useful in handling cases such as data falling beyond the model's cutoff date, or for augmenting domain-specific knowledge that the model hasn't previously encountered. So, can you envision other scenarios where RAG could be a game-changer? ------------------------------ Did I oversimplify too much? How would you explain it? Let me know if you have any analogies! #NLP #GenAI #Embeddings #DataScience #RAG #LanguageModels
Like Comment
To view or add a comment, sign in
Rob James

Research Engineer & Plant Enthusiast
9mo
Report this post
I'm meeting with a client tomorrow morning who has some pretty specific—and somewhat lofty—ambitions. As I prepare a solution, I keep coming back to the 'Hyena Hierarchy' paper as a potential pathway. I've read this paper multiple times, and even after simplifying it as much as I can, I still have questions. I'll end this post with one particular question that's on my mind. For those unfamiliar with it, the paper explores the limitations of large Transformers, specifically the quadratic computational cost associated with their attention mechanisms. The authors introduce a new architecture known as the Hyena operator, designed to serve as a subquadratic drop-in replacement for attention. 🔑 Key Insights: 1️⃣ Hyena Operator: Defined by a recurrence of two efficient subquadratic primitives—an implicit long convolution and multiplicative element-wise gating of the input. 2️⃣ Performance: In tests involving sequences with hundreds of thousands of tokens, Hyena outperforms other operators by over 50% in accuracy. It sets a new state-of-the-art for dense-attention-free architectures on standard datasets like WikiText103 and The Pile. 3️⃣ Efficiency: Achieves up to a 100x speedup over standard attention implementations in PyTorch for sequence lengths of 64k. 4️⃣ Generalization: Hyena is designed to generalize existing subquadratic approaches by introducing a recurrence of gates and implicit long convolutions. 🤔 My Question: How do you think the Hyena operator's subquadratic computational complexity could impact the scalability of future language models? Could this architecture be generalized to other domains beyond NLP?" #NLP #MachineLearning #Transformers #HyenaOperator #ComputationalEfficiency #VirtualReality #MixedReality Michael Poli Stefano Massaroli Eric Nguyen Dan Fu Tri Dao Stephen Baccus Yoshua Bengio Stefano Ermon @christopher re

1 Comment
Like Comment
To view or add a comment, sign in
Nour Eddine Mohtaram, PhD.

Freelance et formateur en Data Science | ML engineer | Computer Vision & AI Researcher
2mo Edited
Report this post
Fine-tine Idefics2 for image to json !! •HuggingFaceM4 introduced Idefics 2, a multimodal LLM that has quickly become a favorite in the open-source community. • The Idefics family of models has impressed me with its robust capabilities, and Idefics 2 is no exception. •The growing interest in vision-language models (VLMs) has been fueled by advancements in large language models and vision transformers. • With its vision capabilities, it excels at understanding scanned images, enhancing OCR, and extracting data from tables. •Despite being 8B in size, like #llama3, #idefics2 seems to fly under the radar amidst the usual PR frenzy. • Idefics2 attains state-of-the-art performance within its size category across diverse multimodal benchmarks, often rivaling models four times its size. Abs: https://lnkd.in/evscPxin Paper : https://lnkd.in/e7Zt-ZtY Blog : https://lnkd.in/exMjeG5v #OpenIA #hugginface #OCR #tesseract #json #computervision #nlp #LLM #cvpr #cvpr2024
Like Comment
To view or add a comment, sign in
Thanvi Reddy Middela

Data Science Intern | ML | GenAI
2mo
Report this post
I'm thrilled to share my latest project with you all: the RAG System! 📝💡 Imagine a large language model (LLM) that can access and understand information beyond its training data! I've been working on a project that does just that. LLMs like me (Gemini) can be amazing, but we sometimes lack specific details. This project uses a system called RAG ( Retriever-Augmenter-Generator) to bridge that gap. The result? More accurate and insightful answers based on both the LLM's inherent knowledge and external data! 🌟 Finding the answer: The system searches the paper (stored as a text file) for relevant information. Combining knowledge: The LLM use the retrieved information to understand the context and answer your question in an informative way. The outcome? More accurate and insightful answers based on both the LLM's knowledge and external data! This project explores how LLMs can be enhanced by accessing external sources. It's a step towards building even more powerful and versatile language models in the future! Huge thanks to Kanav Bansal and Innomatics Research Labs Team for their invaluable support and knowledge! 🙌 Check out the project on GitHub here , https://lnkd.in/g7j_fNQi and watch the demo video. #LLM #AI #NLP #MachineLearning #RAG #LeaveNoContextBehind
Like Comment
To view or add a comment, sign in
Kalyan KS
8mo
Report this post
JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings
4 Comments
Like Comment
To view or add a comment, sign in
Sachin Gupta

Doctoral Researcher (Emerging Technologies, Gen AI) |AI Product Mentor| Generative AI Specialist | AI Architect| Growth Strategy expert|M.Tech in DataScience from BITS
8mo
Report this post
Open source embedding for long document. Strong research in sentence embedding called JiNA embedding with 8k seq length. Earlier we used use Bert sentence embedding but embedding paradigm are changing. Good part is performance is equal to open AI text embedding ada-002.
Kalyan KS
8mo

JINA EMBEDDINGS 2 - Open Source Text Embeddings for Long Documents 1️⃣ Text embedding models are powerful tools for representing text as fixed-sized vectors. 2️⃣ Most existing open-source models, especially those built on architectures like BERT, struggle to represent lengthy documents. 3️⃣ Jina Embeddings v2, an open-source text embedding model addresses this issue. 4️⃣ Jina Embeddings v2 is capable of encoding long documents of up to 8192 tokens. 5️⃣ Jina Embeddings v2 not only achieves state-of-the-art performance on MTEB benchmark. 6️⃣ Jina Embeddings v2 matches the performance of OpenAI’s proprietary text-embedding-ada-002 model. ➡️ Jina Embeddings v2 (base model) link: https://lnkd.in/gYv_Xnhq ➡️ Jina Embeddings v2 (small model) link: https://lnkd.in/gSektkWa ✔️ For complete details, refer the paper (paper link in the comments) #nlproc #nlp #deeplearning #datascience #ai #generativeai #embeddings
Like Comment
To view or add a comment, sign in

10,665 followers

153 Posts

View Profile Follow

Stefano Fiorucci’s Post

More Relevant Posts

Explore topics