Vals AI

Vals AI

Software Development

Evaluating Large Language Models.

About us

Website
https://vals.ai
Industry
Software Development
Company size
2-10 employees
Headquarters
San Francisco
Type
Privately Held

Locations

Employees at Vals AI

Updates

  • Vals AI reposted this

    View profile for Leonard Park, graphic

    Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

    Since I can't let Campbell have all of the Day 0 fun regarding OpenAI's o1-preview announcement: 1. The naming convention is a travesty. We went from gpt-{number}-{suffix} to gpt-{number}{letter}-{suffix} to now {letter}{number}-{suffix}. 😫🤮 2. o1 works through some RL finetuning that bakes in agentic planning/reasoning into a "Reasoning" generation phase which is used to help in complex reasoning tasks. These Reasoning tokens are then discarded and only the post-Reasoning answer is provided back to the user. Anthropic is rumored to use a similar technique for their Claude.ai chatbot model which will use <thinking> tokens to plan that are not revealed in the answer, which can be exposed through complicated prompt engineering. 2a. Obscuring the Reasoning tokens feels like a play to "own" agentic reasoning behind a moat, while charging a premium for it. They can optimize and differentiate 4o models for cost-efficient zero-shot performance, with the 🤑premium 🤑 o1 reasoning models as the high-end offering. 2b. All of this Reasoning means increased cost and inference time. And to support all of this, the o1 models now have 32k output limits. Since input and output (and reasoning) tokens share the same pool of tokens, this could mean reserving a lot more output tokens to prevent truncated answers (via the new "max_completion_tokens" API parameter added to support o1 models). This isn't likely to matter most of the time, and it's hard to actually know with the Reasoning tokens being obscured. 3. I don't have a Tier 5 account, so I have to wait to set my credit card on fire, but ChatGPT+ has o1 models selectable today. Just keep in mind they are 30/50 requests per WEEK right now. This will likely get raised soon, but for now make them count!

    • No alternative text description for this image
  • View organization page for Vals AI, graphic

    315 followers

    Super exciting work released by Harvey yesterday! We believe that strong evals are the foundation of great products, and in-house evals are an essential step towards this. It's also important that the benchmarks reflect real-world workflows - in this case, the BigLaw benchmark represents work that real lawyers do daily. There’s still plenty of work to do around neutral, third-party review. As was alluded to in their appendix, we’re looking forward to collaborative efforts to develop industry-standard benchmarks for legal tasks with our vals.ai effort.

    View profile for Winston Weinberg, graphic

    CEO & Co-Founder @ Harvey

    Excited to announce BigLaw Bench, a new standard to evaluate legal AI systems based on real-world billable work that lawyers actually do. We define performance on this benchmark as "What % of a lawyer-quality work product does the model complete for the user?" Harvey’s AI systems outperform leading foundation models on domain-specific tasks, producing 74% of a final, expert lawyer-quality work product. The outputs are more detailed, capture more nuance, and are much closer to final lawyer quality. More details and performance on different tasks coming soon. https://lnkd.in/gV9BSEsB

    Introducing BigLaw Bench

    Introducing BigLaw Bench

    harvey.ai

  • Vals AI reposted this

    View profile for Leonard Park, graphic

    Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs

    Meta Facebook's Llama 3.1 Release is at the top of the LegalBench leaderboard, according to vals.ai's most recent benchmarking. This could indicate a really high level of legal language parsing capabilities for an LLM, or maybe they trained on the corpus 😜. Llama 3.1 405B is at the top of the charts AND 😲 Llama 3.1 70B 😲 takes second place above GPT-4o. I'd previously avoided using the Llama family of models due to their limited context windows, but that has also been addressed with the 3.1 family. Still, beating OpenAI's GPT-4o at less than 1/5 the cost (exact cost varies by provider) is very impressive.

    View organization page for Vals AI, graphic

    315 followers

    🚨 Llama 3.1 405B sets the new SOTA on several of our legal tasks 🚨 - Llama 3.1 405B and 70B achieved the top one and two spots on LegalBench, a composition of 150 legal tasks across 5 categories. - Llama 3.1 405B also set a new SOTA on our ContractLaw tasks, one of our completely private datasets. On TaxEval and CorpFin, it achieved 4th and 6th. - It's priced similarly to Sonnet 3.5 and GPT-4o on input tokens at $5 / MTok. At $5 / MTok for output tokens, though, it is cheaper for longer outputs. With this powerful new entry from Meta, we will see how Anthropic and OpenAI respond. Anthropic previously has teased 3.5 Opus, and OpenAI released GPT-4 Mini recently, which significantly outperformed many of the other "budget" models on our evaluations. See the full results at https://www.vals.ai.

  • Vals AI reposted this

    View organization page for Vals AI, graphic

    315 followers

    🚨 Llama 3.1 405B sets the new SOTA on several of our legal tasks 🚨 - Llama 3.1 405B and 70B achieved the top one and two spots on LegalBench, a composition of 150 legal tasks across 5 categories. - Llama 3.1 405B also set a new SOTA on our ContractLaw tasks, one of our completely private datasets. On TaxEval and CorpFin, it achieved 4th and 6th. - It's priced similarly to Sonnet 3.5 and GPT-4o on input tokens at $5 / MTok. At $5 / MTok for output tokens, though, it is cheaper for longer outputs. With this powerful new entry from Meta, we will see how Anthropic and OpenAI respond. Anthropic previously has teased 3.5 Opus, and OpenAI released GPT-4 Mini recently, which significantly outperformed many of the other "budget" models on our evaluations. See the full results at https://www.vals.ai.

  • View organization page for Vals AI, graphic

    315 followers

    🚨 Llama 3.1 405B sets the new SOTA on several of our legal tasks 🚨 - Llama 3.1 405B and 70B achieved the top one and two spots on LegalBench, a composition of 150 legal tasks across 5 categories. - Llama 3.1 405B also set a new SOTA on our ContractLaw tasks, one of our completely private datasets. On TaxEval and CorpFin, it achieved 4th and 6th. - It's priced similarly to Sonnet 3.5 and GPT-4o on input tokens at $5 / MTok. At $5 / MTok for output tokens, though, it is cheaper for longer outputs. With this powerful new entry from Meta, we will see how Anthropic and OpenAI respond. Anthropic previously has teased 3.5 Opus, and OpenAI released GPT-4 Mini recently, which significantly outperformed many of the other "budget" models on our evaluations. See the full results at https://www.vals.ai.

  • View organization page for Vals AI, graphic

    315 followers

    Join our discussion at the LegalTech Hub’s Innovation conference on September 19th! 🔥 Thanks to Nicola Shaver and the Legal Tech team for organizing this incredible conference with a top-tier lineup.

    View profile for Nicola Shaver, graphic

    Driving the Future of Law at Legaltech Hub | Innovation, AI, Legaltech Leader, Advisor, Investor | LLB, MBA | Fastcase 50, 2021 & 2024, ABA Women of Legaltech, 2022 | Adjunct Professor

    Good news: Registration for our annual Innovation and Legaltech Conference is now live, with early bird pricing of only $395 per person available until July 31st! Taking place on Thursday, September 19 at New York Law School, this one-day event is not to be missed, including: 🔥 Insights on scaling innovation and AI deployment from the former head of innovation at Pepsico 🔥 A live roundtable with some of the key names in legal AI, including Harvey CEO Winston Weinberg, Leya CEO Max Junestrand, vLex VP Damien Riehl 🔥 AI regulatory update from an expert on the subject, Michael Charles Borrelli 🔥 Practical insights on adopting responsible AI policies with Hadassah Drukarch of the Responsible AI Institute and Stephanie Goutos 🔥 the latest on evaluating AI efficacy from a company in the business of evaluating LLM performance, vals.ai Rayan K. and Langston Nashold, including overview of a new legal AI evaluation project in the works ... with more to come. At this price range it's the perfect conference to send your team to, and your makers / AI experts / data scientists / data analysts. This is the conference that gets into the technical stuff, that introduces you to the cool new products on the market, that allows you to ask questions of key people in the market. Not to be missed! Grab your early bird tickets below. #legaltech #legal #Ai #GenAI #legalinnovation

    The 2024 LTH Innovation and Tech-Enabled Lawyering Conference | Registration

    The 2024 LTH Innovation and Tech-Enabled Lawyering Conference | Registration

    2024innovation.legaltechnologyhub.com

  • View organization page for Vals AI, graphic

    315 followers

    Excellent article from Laurent Wiesel

    View profile for Laurent Wiesel, graphic

    Legal Engineer | Innovating Law with Advanced AI & Programming Expertise | Transforming Legal Practice through Technology

    ChatGPT and other large language models have blown our minds with capabilities beyond our wildest expectations. But as we move from "wow factor" to real-world implementation, quality becomes paramount. AI quality isn't only about accuracy. In this article, I begin to share my takeaways from last week's AI Quality Conference in San Francisco by introducing 12 pillars of AI quality that every organization should consider in evaluating AI systems, including foundational model quality, data quality, and more. Whether you’re a BigLaw partner, in-house counsel, or legal tech enthusiast, understanding these pillars is crucial. They’re not just theoretical – they’re the building blocks of trustworthy, effective, and ethically sound AI in law. #AIinLegal #LegalAI #LegalTech #FutureOfLaw #AIQuality Which of these pillars intrigues you most? Drop a comment – I’d love to hear your thoughts!

    AI Quality: A 12-Point Framework for Legal AI

    AI Quality: A 12-Point Framework for Legal AI

    Laurent Wiesel on LinkedIn

  • View organization page for Vals AI, graphic

    315 followers

    Anthropic just released a new model, Sonnet 3.5. But how does it stack up on legal tasks? We found that it achieved SOTA performance on two of our four benchmarks. On the remaining 2, it performed well, but behind Opus and GPT-4. Some other notes: - It had much lower latency than Opus across the board, and also comes at a much cheaper price point. - The rollout was smooth - we didn't perceive any delays or server issues while running our benchmarks. - Across the board, it performed much better than Claude Sonnet 3.0. We're excited to see what Claude 3.5 Opus will look like. View the full results at https://www.vals.ai.

    Vals.ai: LegalBench

    Vals.ai: LegalBench

    vals.ai

  • View organization page for Vals AI, graphic

    315 followers

    Now that Gemini has released their pay-as-you-go pricing, we've added them to our public benchmarks at https://vals.ai. Here are some highlights: - Gemini still blocks a significant quantity of requests. It also is overly verbose in many cases, and has more difficulty following in-context examples than other models. - On LegalBench, the model placed 5th - a few percentage points higher than its predecessor Gemini 1.0. - Across all four tasks, it performed similarly to Sonnet and Command R+ - however, it's less than half the price for output tokens (input tokens are equivalent). - The infrastructure still leaves a lot to be desired -- we consistently ran into rate limits that were much lower than the advertised 10K RPD. Unlike other providers that display daily usage in easy-to-read graphs, it is also very hard to monitor and track usage. Google also maintains two separate APIs to access its Gemini models (Vertex and AI Studio), which further decreases the usability. We plan to construct a dataset to test its long context-window capabilities in the coming weeks, as well as testing the Gemini Flash model.

    Vals.ai: LegalBench

    Vals.ai: LegalBench

    vals.ai

  • Vals AI reposted this

    As an early entrant in the legal AI space, Harvey had a unique opportunity to shape the future of legal technology. In fact, my first glimpse of GPT-4 caliber systems came through an early demo of Harvey, shortly before the release of ChatGPT in 2022, underscoring just how far ahead of the curve they were. While Harvey's stealth strategy has left some in the legal community scratching their heads, comparisons to high-profile failures like Theranos are premature at this stage. Harvey is a young company operating in a nascent field, and they deserve the opportunity to prove themselves. But I do agree, greater community building, transparency and engagement could've helped build trust and foster collaboration with the broader legal community. Somebody like Alex Su or David Lat would've done them wonders. As an aside, I've heard from 'senior technical sources' that the Harvey model touted on OpenAI's website is indeed a significant change in model, likely involving both pre-training AND fine-tuning with Harvey-provided or assisted reinforcement learning. This suggests that Harvey has been quietly pushing the boundaries of what's possible in legal AI. But again, without public benchmarks like Stanford University-led LegalBench, or others like vals.ai, we will never know. Same goes with model outputs from Casetext, Part of Thomson Reuters, etc. https://lnkd.in/gBMT4Fgx

    There is Something about Harvey

    There is Something about Harvey

    medium.com

Similar pages

Funding

Vals AI 1 total round

Last Round

Pre seed
See more info on crunchbase