Saurabh Kumar’s Post

Engineering @Adora | Prev. Rapyuta(ML), Yahoo(ML), Nokia | IIT Delhi

5mo

There are many LLMs in the market. GPT4, Claude, Gemini, Mistral, etc. But which LLM suits your needs the best and how to choose it? LLM benchmarks are often confusing. As a user which benchmark score should you be concerned about while choosing the right LLM? Here I am curating a list of popular(not exhaustive) benchmarks and what they aim to measure. 1. SQuAD (Stanford Question Answering Dataset): Tests reading comprehension and answering questions based on given passages. 2. RACE: Evaluate understanding of passages from real-world scenarios through multiple-choice questions. 3. HellaSwag: Tests common sense by predicting what could happen next in a situation. 4. LAMBADA: Checks if a model can predict the right final word in a passage by understanding the context. 5. BoolQ: Yes/no questions to be answered based on short passages. Tests query understanding. 6. MultiRC: Requires answering multiple choice questions by reasoning over multiple passages. 7. ARC: Multiple-choice reasoning questions requiring gathering information from a science database. 8. CommonsenseQA: Tests commonsense reasoning through multiple-choice questions about everyday scenarios. 9. OpenBookQA: Science multiple-choice questions to be answered using an open book of facts. 10. GSM8K: Tests advanced math word problem solving needing multi-step reasoning. 11. CodeXGLUE: Evaluates code understanding and generation abilities across programming languages. 12. APPS: Assesses program understanding skills for Python code. 13. HumanEval: Tests coding abilities by having the model write functional programs. 14. HumanEval-X: An extension of HumanEval with more challenging coding tasks. 15. PIQA: Questions about diagrams and situations to test physical reasoning abilities. 16. VCR: Tests understanding of situations depicted in images through question-answering. 17. NLVR2: Determines if a statement accurately describes a set of images. 18. MMLU: Combines text with images/videos to test multimodal language and vision skills. So choose your LLM by looking at their score on your desirable benchmark.

3 Comments

Hardik Bishnoi

MSCS @ Northeastern University | Prev - ML @Deloitte | Making GenAI systems run fast!

5mo

This is a pretty good list. But unfortunately the relevancy of publically available benchmarks keeps changing because a lot of models on huggingface try to contaminate their training data with test samples to improve bench scores. They are important, nonetheless and just another metric that can be considered and not absolute! Often the best benchmark is testing the model out yourself (if you can, say, in a playground) and looking at the comments about it on r/LocalLLaMA.

1 Reaction

Stefano Fiorucci

Contributing to Haystack, the LLM Framework 🏗️ | NLP Engineer, Craftsman and Explorer 🧭

4mo

I recommend this great blog post that provides guidance on using benchmarks to choose base and chat LLMs: https://osanseviero.github.io/hackerllama/blog/posts/llm_evals/

Harsh Raj

SE II @ JPMC | CSE @ IIT Dharwad

5mo

16. VCR sounds interesting. Any links/ references for this ?

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Ashish Patel 🇮🇳

🔥 6x LinkedIn Top Voice | Sr AWS AI ML Solution Architect at IBM | Generative AI Expert | Author - Hands-on Time Series Analytics with Python | IBM Quantum ML Certified | 12+ Years in AI | MLOps | IIMA | 100k+Followers
4mo
Report this post
This is Amazing…4bit LLM Variant When Recently I was looking and exploring the Code base LLMs model quantisation models….search engine provide me different links and after number off efforts what I found is on The Hugging Face page for mlx-community/starcoder2-15b-4bit presents a model converted to MLX format from the original bigcode/starcoder2-15b. This model is part of a project aimed at advancing and democratizing artificial intelligence through open source and open science. The model is trained for text generation, leveraging transformers and safetensors within the MLX framework. It was originally trained on a dataset named bigcode/the-stack-v2-train-full-ids, indicating its broad applicability in generating code across various programming languages. Key Features: 🔸Model Size: It has 1.31 billion parameters, indicating its complexity and potential for understanding and generating intricate patterns in data. 🔸 Tensor Type: Utilizes FP16 and U32 tensor types for operations, optimizing for performance and memory usage. 🔸 Usage: Demonstrated through a Python example, where the model and tokenizer are loaded for generating text in response to a prompt. This illustrates the model's practical application in generating code or textual responses within a given context. 🔸 Inference API: The model supports loading on an Inference API, making it accessible for serverless applications and facilitating easy integration into various projects. Evaluation Results: The model showcases strong performance across several benchmarks: ◾️ pass@1 on CruxEval-I: 48.1% ◾️ pass@1 on DS-1000: 33.8% ◾️ Accuracy on GSM8K (PAL): 65.1% ◾️ pass@1 on HumanEval+ and HumanEval: 37.8% and 46.3%, respectively ◾️ edit-smiliarity on RepoBench-v1.1: 74.08% These results highlight the model's proficiency in understanding and generating code, making it a valuable tool for developers and researchers interested in leveraging AI for coding tasks and beyond in less memory.
1 Comment
Like Comment
To view or add a comment, sign in
MagiCode (YC S24)

162 followers
3mo
Report this post
🙄 How specific should your prompt be, when working with LLMs? The recent paper on "Testing LLMs on Code Generation with Varying Levels of Prompt Specificity" sheds light on the capabilities of LLMs like GPT-3.5, GPT-4, Claude-2, and Bard in generating Python code, highlighting their adaptability and proficiency across different prompt styles. 🔍 Looking ahead, the study suggests several promising avenues for future research in this field: 1. Exploration of Different LLMs: Beyond the models studied, such as Llama 2 and Google's upcoming Gemini, could provide a more comprehensive understanding of LLM capabilities. 2. Testing in Different Programming Languages: Extending research to other languages could reveal the universality of LLM capabilities and their reasoning ability across programming paradigms. 3. Examination of More Complex Problems: Challenging LLMs with intricate problems could gauge their depth of understanding, logical reasoning, and problem-solving skills. 4. Experimentation with Varied Prompt Types: Exploring different types of prompts, including visual cues or real-world context, could provide insights into how LLMs interpret diverse inputs. 🌟 These future directions hold immense potential to advance our understanding of LLMs and their applications in automated code generation, paving the way for more efficient and effective software development practices. Stay tuned for more updates as we continue to explore the possibilities of AI in software engineering at MagiCode! 🌟 Read the paper here: https://lnkd.in/g7GdPetd
Like Comment
To view or add a comment, sign in
Fawzi Rida

☁ 🤖 Technical Officer ML Engineering | ML Systems Architect | MCT @ Cellenza
3mo
Report this post
LLM / RAG systems are popping like mushrooms these days, and evaluation of such systems is crucial, here are some tools you can use : RAGAS: Ragas is a framework that helps you evaluate your Retrieval Augmented Generation (RAG) pipelines. It helps in evaluating and quantifying your pipeline performance. Link: https://lnkd.in/eh-hqvrU Giskard: Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. It covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data. Link: https://lnkd.in/e8Re3uZy DeepEval: DeepEval is an open-source LLM evaluation framework that is similar to Pytest but specialized for unit testing LLM outputs. It incorporates the latest research to evaluate LLM outputs based on metrics such as G-Eval, hallucination, answer relevancy, RAGAS, etc. It allows you to easily determine the optimal hyperparameters to improve your RAG pipeline, prevent prompt drifting, or even transition from one LLM to another with confidence. Link : https://lnkd.in/eif9kPt9 MLflow Evaluation: MLflow is a platform that offers a suite of automated tools that streamline the evaluation process, saving time and enhancing accuracy. It supports evaluating various types of LLMs, whether it’s an MLflow pyfunc model, a URI pointing to a registered MLflow model, or any python callable representing your model. Link: https://lnkd.in/eqj2URCP #LLM #Evaluation #RAG #MLFLOW #RAGAS #DEEPEVAL #GISKARD
Like Comment
To view or add a comment, sign in
Olimpiu P.

CTO | Technology Consultant | VP Engineering | Director Engineering | Head Of Engineering | Engineering Manager | Architect
4mo
Report this post
Andrej Karpathy famously mentioned that English is the hottest #programminglanguage. The question that I am trying to respond to is for how long will we need classical programming languages? The paper pointed out by Elvis S. experiments exactly in this direction: using #LLMs as #compiler. This is a good step forward.
Elvis S.

Co-founder at DAIR.AI | PhD | Prev: Meta AI, Galactica LLM, PapersWithCode, Elastic | Creator of the Prompting Guide (5M+ learners)
4mo

LLMs as Compilers This work proposes a think-and-execute framework to decompose the reasoning process in language models. This helps to improve algorithmic reasoning in LLMs. - It first THINKS to discover a task-level logic to solve a task and express logic in pseudocode - It then EXECUTES to simulate the execution of the code with language models Apparently, the process of discovering task-level logic behind a task improves performance on algorithmic reasoning tasks. It also outperforms instance-specific reasoning approaches like chain-of-thought and program-of-thoughts. Pseudocode understanding and generation is powerful and this paper shows how it can improve reasoning in LLMs. This is fascinating as pseudocode is one of my favorite approaches to solving really complex problems and build complex programs. Similar to the paper I shared yesterday on visualization-as-thoughts, I think this trend continues where we borrow ideas from how humans solve problems to improve LLMs and how we interact with them. Paper: https://lnkd.in/eTNJtVWv -- You can also track my weekly summary of LLM papers and research developments here: https://lnkd.in/e6ajg945
Like Comment
To view or add a comment, sign in
Felix Sam Nanor

AI/ML Engineer | Robotics Enthusiast | Content Creator | Tech Educator and a Freelancer
7mo
Report this post
🚀 Exciting News from Deci AI! Introducing DeciCoder6B & DeciDiffusion2, the latest models designed to enhance performance on cost-efficient hardware like Qualcomm's Cloud AI 100. 🌟 💡 DeciCoder-6B: ✅ Top-tier code generation in 8 languages, with exceptional performance in Python. ✅ Outperforms CodeGen 2.5 7B, CodeLlama 7B, and StarCoder 7B on HumanEval, leading by 3 points in Python compared to StarCoderBase 15.5B. ✅ Achieves 19x higher throughput than CodeGen 2.5 7B on Qualcomm's Cloud AI 100. ✅ Released under the Apache 2.0 license. 🔗 Explore DeciCoder-6B: 👉 Model Card: https://lnkd.in/dRC79_2X 👉 Blog: https://hubs.ly/Q02gtXS70 👉 Notebook: https://lnkd.in/dnFZjsPj 🔥 DeciDiffusion 2.0: ✨ 2.6 times faster and 61% more cost-effective than Stable Diffusion 1.5, while maintaining comparable image quality on Qualcomm's Cloud AI 100. 🧠 Powered by a 732M-parameter model. 🖼️ Utilizes the innovative SqueezedDPM++ technique for superior image quality in fewer steps. 📜 Released under the CreativeML Open RAIL++-M License. 🔗 Explore DeciDiffusion 2.0: 👉 Blog: https://hubs.ly/Q02gv32y0 👉 Model Card: https://lnkd.in/da_zXiN5 👉 Notebook: https://lnkd.in/dyBvRhMB 👍 Like the models on HF and download them for your next AI project! Don't miss out on these incredible advancements from Deci AI. 🌟

Deci/DeciCoder-6B · Hugging Face

huggingface.co
Like Comment
To view or add a comment, sign in
TechWatt.AI

1,058 followers
7mo
Report this post
🚀 Exciting News from Deci AI! Introducing DeciCoder6B & DeciDiffusion2, the latest models designed to enhance performance on cost-efficient hardware like Qualcomm's Cloud AI 100. 🌟 💡 DeciCoder-6B: ✅ Top-tier code generation in 8 languages, with exceptional performance in Python. ✅ Outperforms CodeGen 2.5 7B, CodeLlama 7B, and StarCoder 7B on HumanEval, leading by 3 points in Python compared to StarCoderBase 15.5B. ✅ Achieves 19x higher throughput than CodeGen 2.5 7B on Qualcomm's Cloud AI 100. ✅ Released under the Apache 2.0 license. 🔗 Explore DeciCoder-6B: 👉 Model Card: https://lnkd.in/dRC79_2X 👉 Blog: https://hubs.ly/Q02gtXS70 👉 Notebook: https://lnkd.in/dnFZjsPj 🔥 DeciDiffusion 2.0: ✨ 2.6 times faster and 61% more cost-effective than Stable Diffusion 1.5, while maintaining comparable image quality on Qualcomm's Cloud AI 100. 🧠 Powered by a 732M-parameter model. 🖼️ Utilizes the innovative SqueezedDPM++ technique for superior image quality in fewer steps. 📜 Released under the CreativeML Open RAIL++-M License. 🔗 Explore DeciDiffusion 2.0: 👉 Blog: https://hubs.ly/Q02gv32y0 👉 Model Card: https://lnkd.in/da_zXiN5 👉 Notebook: https://lnkd.in/dyBvRhMB 👍 Like the models on HF and download them for your next AI project! Don't miss out on these incredible advancements from Deci AI. 🌟

Deci/DeciCoder-6B · Hugging Face

huggingface.co
Like Comment
To view or add a comment, sign in
Manuel Romero

Co-Founder and CSO @ MAISA
3mo
Report this post
🚀 Exploring #LoRA and Full Finetuning in Large Language Models 🧵👇 🔍 New Research from Columbia University & Databricks Mosaic AI! 1️⃣ Instruction Finetuning with 100K Prompt-Response Pairs - 🔹 Finding: Full finetuning outperforms LoRA in specialized tasks like programming and math. - 🔹 Insight: Full finetuning adapts better to specific tasks, while LoRA's regularization maintains broader task performance. 2️⃣ Continued Pretraining with 10B Unstructured Tokens - 🔹Finding: Full finetuning excels in target domains, LoRA retains base model capabilities in other tasks. - 🔹 Insight: Full finetuning’s adaptability leads to better target task performance, but LoRA preserves a wider skill set. 3️⃣ Regularization Effectiveness - 🔹 Finding: LoRA offers stronger regularization than weight decay and dropout. - 🔹 Insight: LoRA maintains performance across diverse tasks and generates more varied outputs. Great for general-purpose use! 4️⃣ Rank of Learned Perturbations - 🔹 Finding: Full finetuning learns perturbations with a rank 10-100x higher than typical LoRA configs. - 🔹 Insight: Higher rank explains full finetuning’s performance but comes with higher memory and computational costs. 5️⃣ Best Practices for Finetuning with LoRA - 🔹 Finding: Proposes strategies for balancing target task performance and broader model capabilities. - 🔹 Insight: Achieving a balance between specialization and generalization leverages LoRA’s regularization benefits. Conclusion - Full finetuning maximizes specific domain performance but at higher costs and risk of overfitting. - LoRA provides a balanced approach, maintaining broader capabilities with less computational power. Read the full paper here: 🔗 https://lnkd.in/dhMegFSZ

LoRA Learns Less and Forgets Less

arxiv.org

1 Comment
Like Comment
To view or add a comment, sign in
Nika Samkharadze

Full Stack Developer | Blockchain & Game Development Enthusiast
3mo
Report this post
A Journey Through the World of Algorithms: The post provides an overview of algorithms, their importance in problem-solving and efficiency improvement. It discusses different methods of specifying algorithms, such as natural language, pseudocode, flowcharts, and programming language code. The post also covers the fundamentals of algorithm analysis, Big-O notation, and various algorithm design techniques. Additionally, it presents examples of well-known algorithms, including divide and conquer, dynamic programming, greedy algorithms, and brute force algorithms. #generalprogramming #algorithms #datastructures https://lnkd.in/gxpZXbeS

A Journey Through the World of Algorithms

medium.com
Like Comment
To view or add a comment, sign in
Abhishek Nakka

MS in Big Data @Trent | Data Engineering & Systems Administration | Integration Analyst | Technical Consultant with 2 Years Experience | Seeking Opportunities in Tech | Azure Certified Engineer
1mo
Report this post
Week 4 Recap: #LLMs #VectorEmbeddings #Dlthub This week has been great in terms of learnings ! Up until now, I just saw LLMs as a magic box producing outputs through some neural network inside. But now, I’ve delved deeper and learned some critical concepts, still more to learn: #Embeddings: These are numerical representations of bunch of numerical data that capture essential patterns and relationships within the data. #VectorEmbeddings: Vector embeddings are a type of embedding that represents data as high-dimensional vectors. #VectorSearch: Vector search involves finding vectors that are similar to a given query vector. This type of search is used in applications like semantic search, where the goal is to find documents or items that are semantically similar to the query. Vector Databases: Vector databases are specialized databases designed to store and manage vector embeddings. I was also introduced to #DLT (Data Load Tool) and sentence_transformers, which was fascinating. Data Load Tool (DLT): DLT was particularly interesting because it can adapt to dynamic data and schema changes without disrupting pipelines. I created an LLM chatbot using the following technologies: DLT for Data Ingestion: - DLT can easily connect to any REST API source. - It integrates seamlessly with vector databases like LanceDB. - It also supports incremental loading functionality. LanceDB as a Vector Database: LanceDB is an open-source vector database that integrates smoothly into Python workflows. Ollama for RAG: Ollama is open-source and allows you to run LLMs locally with ease. There’s still a lot more to learn and explore, and I’m excited about what’s next. A big thank you to Alexey Grigorev and the #DataTalksClub community for this incredible learning journey. Stay tuned for more updates next week! Let's keep exploring and pushing our boundaries.
1 Comment
Like Comment
To view or add a comment, sign in
Anshuman Jha

AI Multi-Agents | GenAI | LLM | RAG | Al Consultant
5mo
Report this post
LLM4Decompile: Genesis of the Open-Source Decompilation LLM First, what is Decompilation?? Decompilation, the process of converting compiled code back to human-readable source code, has long been a challenge due to the loss of structural and naming information. However, LLMs have emerged as a game-changer in this field. Goals: 1. Assess the capabilities of our LLM by establishing the first decompilation benchmark focused on re-compilability and re-executability. 2. Compile a massive dataset of C code samples and their corresponding assembly code, forming a foundation for model training. 3. Fine-tune a state-of-the-art code LLM using our curated dataset to enhance its decompilation abilities. 4. Construct a robust evaluation benchmark to thoroughly assess the quality of decompiled code. Evaluation Process: 1. Compilation: Source code is meticulously compiled into executable binaries using industry-standard compilers. 2. Disassembly: Binaries are disassembled into assembly language, providing a low-level representation of the code. 3. Decompilation: Assembly instructions are decompiled to reconstruct human-readable source code. 4. Quality Assessment: Decompiled code is rigorously tested for re-compilability and functionality, ensuring its accuracy and reliability. The experiments have shown that LLM4Decompile can accurately decompile 21% of assembly code, marking a significant 50% improvement over GPT-4. This breakthrough opens up new possibilities for program comprehension, analysis, and reverse engineering. Read more about the LLM4Decompile: Open-Source Decompilation LLM in this research paper: https://lnkd.in/defcKCcD Also check the code, dataset, and models: https://lnkd.in/dk85D-3i #opensource #Decompilation #llm #CodeIntelligence #softwareengineering #revolution #code #datasets #model
Like Comment
To view or add a comment, sign in

11,396 followers

866 Posts

View Profile Follow

Saurabh Kumar’s Post

More Relevant Posts

Explore topics