Vals AI

Software Development

Evaluating Large Language Models.

View all 2 employees

About us

Website: https://vals.ai
External link for Vals AI
Industry: Software Development
Company size: 2-10 employees
Headquarters: San Francisco
Type: Privately Held

Locations

Primary

San Francisco, US

Get directions

Employees at Vals AI

See all employees

Updates

Vals AI reposted this

Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs
2d
Report this post
Since I can't let Campbell have all of the Day 0 fun regarding OpenAI's o1-preview announcement: 1. The naming convention is a travesty. We went from gpt-{number}-{suffix} to gpt-{number}{letter}-{suffix} to now {letter}{number}-{suffix}. 😫🤮 2. o1 works through some RL finetuning that bakes in agentic planning/reasoning into a "Reasoning" generation phase which is used to help in complex reasoning tasks. These Reasoning tokens are then discarded and only the post-Reasoning answer is provided back to the user. Anthropic is rumored to use a similar technique for their Claude.ai chatbot model which will use <thinking> tokens to plan that are not revealed in the answer, which can be exposed through complicated prompt engineering. 2a. Obscuring the Reasoning tokens feels like a play to "own" agentic reasoning behind a moat, while charging a premium for it. They can optimize and differentiate 4o models for cost-efficient zero-shot performance, with the 🤑premium 🤑 o1 reasoning models as the high-end offering. 2b. All of this Reasoning means increased cost and inference time. And to support all of this, the o1 models now have 32k output limits. Since input and output (and reasoning) tokens share the same pool of tokens, this could mean reserving a lot more output tokens to prevent truncated answers (via the new "max_completion_tokens" API parameter added to support o1 models). This isn't likely to matter most of the time, and it's hard to actually know with the Reasoning tokens being obscured. 3. I don't have a Tier 5 account, so I have to wait to set my credit card on fire, but ChatGPT+ has o1 models selectable today. Just keep in mind they are 30/50 requests per WEEK right now. This will likely get raised soon, but for now make them count!
26 Comments

Like Comment Share
Vals AI

315 followers
2w
Report this post
Super exciting work released by Harvey yesterday! We believe that strong evals are the foundation of great products, and in-house evals are an essential step towards this. It's also important that the benchmarks reflect real-world workflows - in this case, the BigLaw benchmark represents work that real lawyers do daily. There’s still plenty of work to do around neutral, third-party review. As was alluded to in their appendix, we’re looking forward to collaborative efforts to develop industry-standard benchmarks for legal tasks with our vals.ai effort.

Winston Weinberg

CEO & Co-Founder @ Harvey
2w

Excited to announce BigLaw Bench, a new standard to evaluate legal AI systems based on real-world billable work that lawyers actually do. We define performance on this benchmark as "What % of a lawyer-quality work product does the model complete for the user?" Harvey’s AI systems outperform leading foundation models on domain-specific tasks, producing 74% of a final, expert lawyer-quality work product. The outputs are more detailed, capture more nuance, and are much closer to final lawyer quality. More details and performance on different tasks coming soon. https://lnkd.in/gV9BSEsB

Introducing BigLaw Bench

harvey.ai

1 Comment

Like Comment Share
Vals AI reposted this

Leonard Park

Experienced LegalTech Product Manager and Attorney | Passionate about leveraging AI/LLMs
1mo
Report this post
Meta Facebook's Llama 3.1 Release is at the top of the LegalBench leaderboard, according to vals.ai's most recent benchmarking. This could indicate a really high level of legal language parsing capabilities for an LLM, or maybe they trained on the corpus 😜. Llama 3.1 405B is at the top of the charts AND 😲 Llama 3.1 70B 😲 takes second place above GPT-4o. I'd previously avoided using the Llama family of models due to their limited context windows, but that has also been addressed with the 3.1 family. Still, beating OpenAI's GPT-4o at less than 1/5 the cost (exact cost varies by provider) is very impressive.

Vals AI

315 followers
1mo Edited

🚨 Llama 3.1 405B sets the new SOTA on several of our legal tasks 🚨 - Llama 3.1 405B and 70B achieved the top one and two spots on LegalBench, a composition of 150 legal tasks across 5 categories. - Llama 3.1 405B also set a new SOTA on our ContractLaw tasks, one of our completely private datasets. On TaxEval and CorpFin, it achieved 4th and 6th. - It's priced similarly to Sonnet 3.5 and GPT-4o on input tokens at $5 / MTok. At $5 / MTok for output tokens, though, it is cheaper for longer outputs. With this powerful new entry from Meta, we will see how Anthropic and OpenAI respond. Anthropic previously has teased 3.5 Opus, and OpenAI released GPT-4 Mini recently, which significantly outperformed many of the other "budget" models on our evaluations. See the full results at https://www.vals.ai.

6 Comments

Like Comment Share
Vals AI reposted this

Vals AI

315 followers
1mo Edited
Report this post
🚨 Llama 3.1 405B sets the new SOTA on several of our legal tasks 🚨 - Llama 3.1 405B and 70B achieved the top one and two spots on LegalBench, a composition of 150 legal tasks across 5 categories. - Llama 3.1 405B also set a new SOTA on our ContractLaw tasks, one of our completely private datasets. On TaxEval and CorpFin, it achieved 4th and 6th. - It's priced similarly to Sonnet 3.5 and GPT-4o on input tokens at $5 / MTok. At $5 / MTok for output tokens, though, it is cheaper for longer outputs. With this powerful new entry from Meta, we will see how Anthropic and OpenAI respond. Anthropic previously has teased 3.5 Opus, and OpenAI released GPT-4 Mini recently, which significantly outperformed many of the other "budget" models on our evaluations. See the full results at https://www.vals.ai.

2 Comments

Like Comment Share
Vals AI

315 followers
1mo Edited
Report this post
🚨 Llama 3.1 405B sets the new SOTA on several of our legal tasks 🚨 - Llama 3.1 405B and 70B achieved the top one and two spots on LegalBench, a composition of 150 legal tasks across 5 categories. - Llama 3.1 405B also set a new SOTA on our ContractLaw tasks, one of our completely private datasets. On TaxEval and CorpFin, it achieved 4th and 6th. - It's priced similarly to Sonnet 3.5 and GPT-4o on input tokens at $5 / MTok. At $5 / MTok for output tokens, though, it is cheaper for longer outputs. With this powerful new entry from Meta, we will see how Anthropic and OpenAI respond. Anthropic previously has teased 3.5 Opus, and OpenAI released GPT-4 Mini recently, which significantly outperformed many of the other "budget" models on our evaluations. See the full results at https://www.vals.ai.

2 Comments

Like Comment Share
Vals AI

315 followers
2mo
Report this post
Join our discussion at the LegalTech Hub’s Innovation conference on September 19th! 🔥 Thanks to Nicola Shaver and the Legal Tech team for organizing this incredible conference with a top-tier lineup.

Nicola Shaver

Driving the Future of Law at Legaltech Hub | Innovation, AI, Legaltech Leader, Advisor, Investor | LLB, MBA | Fastcase 50, 2021 & 2024, ABA Women of Legaltech, 2022 | Adjunct Professor
2mo

Good news: Registration for our annual Innovation and Legaltech Conference is now live, with early bird pricing of only $395 per person available until July 31st! Taking place on Thursday, September 19 at New York Law School, this one-day event is not to be missed, including: 🔥 Insights on scaling innovation and AI deployment from the former head of innovation at Pepsico 🔥 A live roundtable with some of the key names in legal AI, including Harvey CEO Winston Weinberg, Leya CEO Max Junestrand, vLex VP Damien Riehl 🔥 AI regulatory update from an expert on the subject, Michael Charles Borrelli 🔥 Practical insights on adopting responsible AI policies with Hadassah Drukarch of the Responsible AI Institute and Stephanie Goutos 🔥 the latest on evaluating AI efficacy from a company in the business of evaluating LLM performance, vals.ai Rayan K. and Langston Nashold, including overview of a new legal AI evaluation project in the works ... with more to come. At this price range it's the perfect conference to send your team to, and your makers / AI experts / data scientists / data analysts. This is the conference that gets into the technical stuff, that introduces you to the cool new products on the market, that allows you to ask questions of key people in the market. Not to be missed! Grab your early bird tickets below. #legaltech #legal #Ai #GenAI #legalinnovation

The 2024 LTH Innovation and Tech-Enabled Lawyering Conference | Registration

2024innovation.legaltechnologyhub.com

Like Comment Share
Vals AI

315 followers
2mo
Report this post
Excellent article from Laurent Wiesel

Laurent Wiesel

Legal Engineer | Innovating Law with Advanced AI & Programming Expertise | Transforming Legal Practice through Technology
2mo

ChatGPT and other large language models have blown our minds with capabilities beyond our wildest expectations. But as we move from "wow factor" to real-world implementation, quality becomes paramount. AI quality isn't only about accuracy. In this article, I begin to share my takeaways from last week's AI Quality Conference in San Francisco by introducing 12 pillars of AI quality that every organization should consider in evaluating AI systems, including foundational model quality, data quality, and more. Whether you’re a BigLaw partner, in-house counsel, or legal tech enthusiast, understanding these pillars is crucial. They’re not just theoretical – they’re the building blocks of trustworthy, effective, and ethically sound AI in law. #AIinLegal #LegalAI #LegalTech #FutureOfLaw #AIQuality Which of these pillars intrigues you most? Drop a comment – I’d love to hear your thoughts!

AI Quality: A 12-Point Framework for Legal AI

Laurent Wiesel on LinkedIn

1 Comment

Like Comment Share
Vals AI

315 followers
2mo
Report this post
Anthropic just released a new model, Sonnet 3.5. But how does it stack up on legal tasks? We found that it achieved SOTA performance on two of our four benchmarks. On the remaining 2, it performed well, but behind Opus and GPT-4. Some other notes: - It had much lower latency than Opus across the board, and also comes at a much cheaper price point. - The rollout was smooth - we didn't perceive any delays or server issues while running our benchmarks. - Across the board, it performed much better than Claude Sonnet 3.0. We're excited to see what Claude 3.5 Opus will look like. View the full results at https://www.vals.ai.

Vals.ai: LegalBench

vals.ai

3 Comments

Like Comment Share
Vals AI

315 followers
3mo Edited
Report this post
Now that Gemini has released their pay-as-you-go pricing, we've added them to our public benchmarks at https://vals.ai. Here are some highlights: - Gemini still blocks a significant quantity of requests. It also is overly verbose in many cases, and has more difficulty following in-context examples than other models. - On LegalBench, the model placed 5th - a few percentage points higher than its predecessor Gemini 1.0. - Across all four tasks, it performed similarly to Sonnet and Command R+ - however, it's less than half the price for output tokens (input tokens are equivalent). - The infrastructure still leaves a lot to be desired -- we consistently ran into rate limits that were much lower than the advertised 10K RPD. Unlike other providers that display daily usage in easy-to-read graphs, it is also very hard to monitor and track usage. Google also maintains two separate APIs to access its Gemini models (Vertex and AI Studio), which further decreases the usability. We plan to construct a dataset to test its long context-window capabilities in the coming weeks, as well as testing the Gemini Flash model.

Vals.ai: LegalBench

vals.ai

Like Comment Share
Vals AI reposted this

Enam Hoque
3mo
Report this post
As an early entrant in the legal AI space, Harvey had a unique opportunity to shape the future of legal technology. In fact, my first glimpse of GPT-4 caliber systems came through an early demo of Harvey, shortly before the release of ChatGPT in 2022, underscoring just how far ahead of the curve they were. While Harvey's stealth strategy has left some in the legal community scratching their heads, comparisons to high-profile failures like Theranos are premature at this stage. Harvey is a young company operating in a nascent field, and they deserve the opportunity to prove themselves. But I do agree, greater community building, transparency and engagement could've helped build trust and foster collaboration with the broader legal community. Somebody like Alex Su or David Lat would've done them wonders. As an aside, I've heard from 'senior technical sources' that the Harvey model touted on OpenAI's website is indeed a significant change in model, likely involving both pre-training AND fine-tuning with Harvey-provided or assisted reinforcement learning. This suggests that Harvey has been quietly pushing the boundaries of what's possible in legal AI. But again, without public benchmarks like Stanford University-led LegalBench, or others like vals.ai, we will never know. Same goes with model outputs from Casetext, Part of Thomson Reuters, etc. https://lnkd.in/gBMT4Fgx

There is Something about Harvey

medium.com

4 Comments

Like Comment Share

Funding

Vals AI 1 total round

Last Round

Pre seed May 1, 2024

See more info on crunchbase

Vals AI

Software Development

Evaluating Large Language Models.

About us

Locations

Employees at Vals AI

Rayan K.

Evaluating LLMs with Vals AI | Stanford

Langston Nashold

Founder @ vals.ai | PearX W24 | Stanford CS

Updates

Introducing BigLaw Bench

harvey.ai

The 2024 LTH Innovation and Tech-Enabled Lawyering Conference | Registration

2024innovation.legaltechnologyhub.com

AI Quality: A 12-Point Framework for Legal AI

Laurent Wiesel on LinkedIn

Vals.ai: LegalBench

vals.ai

Vals.ai: LegalBench

vals.ai

There is Something about Harvey

medium.com

Join now to see what you are missing

Similar pages

Kart AI (YC S24)

Guardrails AI

LlamaIndex

Cymbal

Yieldmo

Status

Hudson River Trading

Helix

Harvey

Hazel

Funding