Artificial Analysis

Artificial Analysis

Technology, Information and Internet

The leading independent benchmark for LLMs - compare quality, speed and price to pick the best model for your use case

About us

The leading independent benchmark for LLMs - compare quality, speed and price to pick the best model for your use case.

Website
https://artificialanalysis.ai/
Industry
Technology, Information and Internet
Company size
11-50 employees
Type
Privately Held

Employees at Artificial Analysis

Updates

  • View organization page for Artificial Analysis, graphic

    2,592 followers

    Thanks for the support Andrew Ng! Completely agree, faster token generation will become increasingly important as a greater proportion of output tokens are consumed by models, such as in multi-step agentic workflows, rather than being read by people.

    View profile for Andrew Ng, graphic
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI

    Shoutout to the team that built https://lnkd.in/g3Y-Zj3W . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM that focus more on the quality of the outputs. I hope benchmarks like this encourage more providers to work on fast token generation, which is critical for agentic workflows!

    Model & API Providers Analysis | Artificial Analysis

    Model & API Providers Analysis | Artificial Analysis

    artificialanalysis.ai

  • Artificial Analysis reposted this

    View profile for Micah Hill-Smith, graphic

    Co-Founder & CEO at Artificial Analysis | Ex-McKinsey

    Today marks 6 months since the launch of Artificial Analysis - so I thought I’d take the opportunity to share the story of what we’ve been building. Artificial Analysis is an independent benchmarking, evaluation and insights provider for AI. Our benchmarks let engineers and companies make the best decisions on which technologies and providers to use, empowering them to build the next generation of AI applications. We’ve built a benchmarking stack to measure quality and performance of AI models that tests hundreds of API endpoints every day. We publish our independent real-time analysis of language, image and voice models, and we work with the companies across the AI value chain as an independent benchmarking partner. A handful of highlights from the last 6 months (links in comments): ‣ Being featured on the All-In Podcast, Latent Space Podcast, No Priors Podcast; being covered in VentureBeat, Gizmodo.com, SemiAnalysis and more ‣ Working with leading AI companies from chips to infrastructure to labs, including supporting recent launches from Groq and SambaNova Systems ‣ Support from industry leaders including Andrew Ng, Shawn swyx W, Chamath Palihapitiya ‣ Tens of thousands of users from using Artificial Analysis every week - from start-ups to large enterprises ‣ Appearing on a panel in San Francisco hosted by BootstrapLabs, sharing thoughts on role of benchmarking and evaluation for building trust in AI systems ‣ Hearing stories every week of how people are using Artificial Analysis to understand the AI market and build incredible applications The story of Artificial Analysis began nearly two years ago when I embarked on building an AI legal research tool. While building analysis tools to select optimal models for different parts of my legal research algorithm, I became obsessed with the problem of understanding and comparing AI models. In early 2023, I built a couple of experimental dashboards to share some of my early work on the problem and began to develop a framework for helping engineers make trade-offs between quality, speed and price. Late last year, I began to collaborate with my friend George Cameron (who I met interning at Google together many years ago!) to build Artificial Analysis. I’ll be sharing more over the coming months - about what metrics matter most in AI, how developers should think about comparing AI models for scaling production applications, how we see our role in the industry evolving as the AI frontier moves forward. As we move into the next phase, our vision is to become the definitive source for data and insights in the AI industry. To achieve this, we're growing our team and hiring now for engineering and analyst roles. If you’re excited about what we’re building here, we’d love to hear from you. Finally - please follow Artificial Analysis on LinkedIn and Twitter to stay in the loop for our upcoming launches and analysis!

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    2,592 followers

    Mistral released Codestral Mamba and Mathstral 7B today! Key takeaways to be aware of and why they are exciting below 👇 ‣ Smaller, task-specific models: These models are focused on Code and Math respectively and contribute to the trend of smaller models focused on specific capabilities. Smaller, task-specific models have the advantage of improving the relative inference speed and reducing compute cost for a given quality on a specific task. It is important to recognize however, that smaller models (~7B) have generally not met the capabilities in benchmarks of larger generalist state of the art models. For example, in the HumanEval coding benchmark, GPT-4o and Claude 3.5 Sonnet are ~90% whilst Codestral Mamba 7B is 75% and Codestral 22B is 81%. That being said, they achieve this at likely 1/100th (or <1%) of the size of these models (GPT-4 rumored at ~1.8T params) and Codestral Mamba offers a much larger context window. ‣ Mamba architecture: Codestral Mamba uses the new Mamba architecture which is an alternate approach to transformers. Mamba is a state space model architecture and has advantages over transformers: transformer inference compute (often inference time) scales quadratically with context/sequence length, while the Mamba architecture scales linearly. This is why Codestral Mamba is able to offer a context window 256k tokens, >7X Mistral 7B’s context window. It also means faster inference/speed when using the models, particularly for large context use-cases such as RAG. ‣ Open source approach: These models are released under the Apache 2.0 license, highlighting Mistral’s commitment to open source without commercial restrictions. Mistral has previously taken a mixed approach, offering proprietary-only models (Mistral Large), releasing open-source models which require a commercial license for commercial use (Codestral 22B) and open-source models without restrictions (Mistral 8x7B, Mistral 8x22B, and now Codestral Mamba and Mathstral). These releases follow the industry trend of AI labs offering their leading models on a closed-source or open-source with commercial restrictions license basis while open-sourcing models which are not their leading models in terms of quality. This does not mean there are not use cases to using these models given their faster speeds and lower compute costs due to their smaller size! Exciting releases from Mistral AI! We plan to benchmark these models if they are offered on Mistral’s API on a commercial basis. It will be interesting to see what people do with these models considering their task-specific performance and the ultra-long context window of Codestral. Will we see loading whole code-bases into context with Codestral Mamba?

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    2,592 followers

    Announcing the Artificial Analysis Text to Speech leaderboard, including our Speech Arena to crowdsource our quality ranking! We’re covering leading speech models from ElevenLabs, OpenAI, Cartesia, LMNT, Google, Amazon Web Services (AWS), Azure, along with open speech models like MetaVoice and StyleTTS hosted on Replicate and fal. We expect voice-enabled AI experiences to grow in importance dramatically over the coming years so we’re excited to bring launch full benchmarking coverage of Text to Speech to sit alongside our LLM and Speech to Text benchmarking. Quality will be be compared using an ELO score based on data from our new Speech Arena. After each vote, you can see which model you preferred and after 30 results, you can see your own personal ranking of the models. You might learn some interesting facts along the way too! Link to Speech Arena: https://lnkd.in/e3z5vagp Results will start showing on the main leaderboard as soon as we’ve collected votes on more than 100 comparisons for each model. For the leaderboard, as usual, we’re analyzing speech models across quality, price and speed. See the below tweets for highlights across our price and speed (API performance)

    • No alternative text description for this image
  • Artificial Analysis reposted this

    View organization page for Groq, graphic

    70,769 followers

    Google’s Gemma 2 9B is live on GroqChat and available via GroqCloud, running at 599 T/s! Try it now and follow along as Artificial Analysis builds out the model’s public benchmark (https://hubs.la/Q02GkqnH0). Gemma 2 9B accompanies other leading open-source models running on Groq from providers like Meta, Mistral, and OpenAI.

    Gemma 2 (9B): API Provider Performance Benchmarking & Price Analysis | Artificial Analysis

    Gemma 2 (9B): API Provider Performance Benchmarking & Price Analysis | Artificial Analysis

    artificialanalysis.ai

  • View organization page for Artificial Analysis, graphic

    2,592 followers

    SambaNova is now offering an API endpoint to access Llama 3 8B on its RDU chips, which we previously benchmarked at 1,084 output tokens/s SambaNova Systems is also differentiating itself from other offerings by allowing users to bring their own fine-tuned versions of the models. They appear to be offering API access on shared-tenant systems and as such, allowing users to bring their own fine-tuned models differentiates from other providers who typically require single-tenant dedicated deployments. This likely leverages memory advantages of their SN40L chip. Access is being offered on a upon request basis. This is a next step toward a open access commercial API offering that allows all AI developers to use its custom silicon RDU chip. We look forward to listing any commercial open access API offerings powered by SambaNova chips in the future on the main Artificial Analysis leaderboards!

    View organization page for SambaNova Systems, graphic

    45,771 followers

    Are you looking to unlock the power of lightning-fast inferencing speed at 1000+ tokens/sec on your own custom Llama3? Introducing SambaNova Fast API, available today with free token-based credits to make it easier to build AI apps like chatbots and more. Bring your own custom checkpoint for both Llama 8B and Llama 70B and avoid the cost of acquiring hundreds of chips to get started.  Relevance? The next phase of AI is Agentic AI; you’ll need lots of models, big and small, working together as one system. Development teams will require ultra-fast token generation, which we know cannot be achieved with GPUs. That is not all… you’ll need to host lots of models concurrently, with instantaneous switching between these models, which we know can’t be achieved with other architectures due to their inefficiency. You can’t get this speed, with a diversity of models, including your own custom model behind a simple API anywhere else!  SambaNova Fast API is available now: https://lnkd.in/g9W_Bnjv #FastAI #RDU #API 

  • View organization page for Artificial Analysis, graphic

    2,592 followers

    Smaller models are getting better and faster. We can see parameter efficiency is increasing with quality increasing from Mistral 7B to Llama 3 8B to the latest Gemma 2 9B with minimal size increases. The Llama 3 and Gemma 2 papers shared the impact of overtraining in achieving this quality. Gemma 2 9B was trained on 8T tokens and Llama 3 on 15T tokens (incl. 70B, the figure specifically for the 8B model was not released). While Google's Gemini 1.5 Flash's parameter count has not been announced, it is much faster speed compared to Gemini 1.5 Pro (165 output tokens/s vs. 61 tokens/s) indicating a much smaller model. It stands out as a clear leader in its quality & speed offering and is the fastest model we benchmark on Artificial Analysis when considering the median performance of providers. Smaller models are ideal for speed, cost and hardware capacity sensitive use-cases. For open source models they also enable local use on consumer GPUs. It's great to see the continued improvements

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    2,592 followers

    Today OpenAI banned access to its API from China. China has already blocked access to ChatGPT using the 'Great Firewall'. In the past few months, we have seen models from AI labs HQ'd in China start to 'catch-up' to the quality of models developed globally. These geo-restrictions will support demand for models developed from AI labs with an HQ in China. We may also see an acceleration in AI development to support this demand. 01.AI, Deepseek, Alibaba Cloud, SenseTime 商汤科技, Baidu, Inc. are key companies to watch in this space. For those using AI, this adds another consideration when choosing models. APIs may not be accessible from everywhere and we could potentially see further restrictions (e.g. use of the output of LLMs). We will look to provide information on this on Artificial Analysis to support users choosing technologies.

    View organization page for Artificial Analysis, graphic

    2,592 followers

    Models from AI labs headquartered in China 🇨🇳 are now competitive with the leading models globally 🌎 Qwen 2 72B from Alibaba Cloud has the highest MMLU score of open-source models, and Yi Large from 01.AI and Deepseek v2 from DeepseekAI are amongst the highest quality models and are priced very competitively. We have initiated coverage of these on Artificial Analysis. Previously models from AI labs with an HQ in China were generally not competitive globally with models from leading AI labs globally. They also had issues being multilingual, likely due to their Chinese focused training data set, and in-cases output Chinese characters in response to English prompts. This has changed over the past couple of months with new models released which benchmark amongst the leading models globally. These labs have achieved this using similar techniques to labs globally, particularly training the models on many times more tokens than is Chinchilla optimal, training larger models, using techniques like Mixture of Experts and improving training data quality (including through extensive use of synthetic & LLM-refined data). The labs are also increasing their marketing to global audiences, as shown by Yi Large being accessible on Fireworks AI. While Qwen 2 72B has the highest MMLU score of open-source models, it is important to note that Meta has announced they are to release Llama 3 405B shortly and this is likely to far exceed capabilities of all open source models available today. We have commenced benchmarking of these models on Artificial Analysis. Link to analysis: https://lnkd.in/g4bbqEre

    • No alternative text description for this image
  • View organization page for Artificial Analysis, graphic

    2,592 followers

    Models from AI labs headquartered in China 🇨🇳 are now competitive with the leading models globally 🌎 Qwen 2 72B from Alibaba Cloud has the highest MMLU score of open-source models, and Yi Large from 01.AI and Deepseek v2 from DeepseekAI are amongst the highest quality models and are priced very competitively. We have initiated coverage of these on Artificial Analysis. Previously models from AI labs with an HQ in China were generally not competitive globally with models from leading AI labs globally. They also had issues being multilingual, likely due to their Chinese focused training data set, and in-cases output Chinese characters in response to English prompts. This has changed over the past couple of months with new models released which benchmark amongst the leading models globally. These labs have achieved this using similar techniques to labs globally, particularly training the models on many times more tokens than is Chinchilla optimal, training larger models, using techniques like Mixture of Experts and improving training data quality (including through extensive use of synthetic & LLM-refined data). The labs are also increasing their marketing to global audiences, as shown by Yi Large being accessible on Fireworks AI. While Qwen 2 72B has the highest MMLU score of open-source models, it is important to note that Meta has announced they are to release Llama 3 405B shortly and this is likely to far exceed capabilities of all open source models available today. We have commenced benchmarking of these models on Artificial Analysis. Link to analysis: https://lnkd.in/g4bbqEre

    • No alternative text description for this image
  • Artificial Analysis reposted this

    View profile for Andrew Ng, graphic
    Andrew Ng Andrew Ng is an Influencer

    Founder of DeepLearning.AI; Managing General Partner of AI Fund; Founder and CEO of Landing AI

    Shoutout to the team that built https://lnkd.in/g3Y-Zj3W . Really neat site that benchmarks the speed of different LLM API providers to help developers pick which models to use. This nicely complements the LMSYS Chatbot Arena, Hugging Face open LLM leaderboards and Stanford's HELM that focus more on the quality of the outputs. I hope benchmarks like this encourage more providers to work on fast token generation, which is critical for agentic workflows!

    Model & API Providers Analysis | Artificial Analysis

    Model & API Providers Analysis | Artificial Analysis

    artificialanalysis.ai

Similar pages