😎 Who knew you could create an amazing LLM that outperforms all other open-source models with just 20,000 human-labeled training samples? 💡 The coolest thing about Nvidia's Nematron is that it became the state-of-the-art open-source LLM using 98% synthetic data and only 20k human-labeled samples! Nematron not only beat all other open-source LLMs on standard benchmarks but also topped the LLMsys leaderboard. What's special about their synthetic data generation? ⛳ Prompt Generation They generate a range of prompts that span various tasks, topics, and instructions using Mixtral-8x7B-Instruct-v0.1 👉 Synthetic Single-Turn Prompts 👉 Synthetic Instruction-Following Prompts 👉 Synthetic Two-Turn Prompts 👉 Real-World LMSYS-Chat-1M Prompts ⛳ Synthetic Dialogue Generation For synthetic dialogue generation, supervised fine-tuning is used to enable models to learn interaction in a dialogue format. Each conversation comprises three turns for a dynamic and interactive flow. The quality of these dialogues is controlled using the Nemotron-4-340B-Reward model, which filters out lower-quality samples. ⛳ Synthetic Preference Data Generation 👉 The synthetic data includes single-turn, instruction-following, and two-turn prompts, as well as real-world prompts from various datasets. 👉 Responses are generated using multiple random intermediate models to ensure diversity. 👉 For judging preference, they use ground-truth labels when available. Otherwise, we employ two methods: LLM-as-Judge, where a language model compares responses, and Reward-Model-as-Judge, where the reward model predicts the reward for each response. ⛳ Iterative Weak-to-Strong Alignment 👉An iterative approach is developed that combines alignment training and data synthesis to refine data towards optimality, enhancing each other and driving continuous improvement. 👉 The approach involves starting with an initial aligned model as the generator, aligning a better base model using supervised fine-tuning and preference tuning, and achieving significant improvement in model performance through iterations. #genaish #llms
98% synthetic, 2% human that shocks me. Have to see it performs in real world scenarios...
Right?! 🤯 It's crazy impressive how efficient synthetic data has become. Also, using the reward model to filter dialogue quality is super smart. Ensures only the best data is used for training. Could lead to even more reliable models in the future.
Aishwarya, synthetic data is becoming more and more promising as the model training and test scenarios evolve.
Insightful! 98% synthetic data needs to be validated with real-world scenarios... If really it does well means amazing
Looks promising will check it out
Has the LLM tested against RWE to ensure of its accuracy?
How do I get started using Nemstron?
Insightful!mmmmmmmmmmmmm
Building the Future of AI Workforce | Founder at Jutsu | Autonomous Agents | Driving Agent Development | OrangeDAO W24
3wThis is a remarkable breakthrough! Utilizing 98% synthetic data and only 20,000 human-labeled samples to achieve state-of-the-art performance is truly innovative. Nvidia’s method of generating diverse synthetic prompts and dialogues is impressive, showcasing the power of synthetic data in training advanced models. The iterative alignment approach ensures continuous improvement, making Nematron a standout in the LLM space. This demonstrates the potential of synthetic data in scaling AI models efficiently. Nvidia, is pushing the boundaries of what’s possible in AI!