Peter Gostev’s Post

View profile for Peter Gostev, graphic

Head of AI @ Moonpig

Ok, this is highly speculative, but I was looking at OpenAI's rate limits and it was pretty striking how despite the original GPT-4 having been out for so long (hence OpenAI have built up more capacity), the rate limits for older models are far lower than for newer models. I'm sure there are many moving parts that dictate the limits, but I wondered, what if the model size is directly driving what rate limits you get? So what I did was to take the rumoured 1.76trn parameters for the original GPT-4, the rate limit of 1m tokens per minute (at Tier 5) and then applied this ratio to the rate limits for GPT-4-Turbo, GPT-4o and GPT-4o mini to deduce their respective parameters. What was interesting that the answer looks fairly reasonable, it seems fairly plausible than GPT-4o-mini could be a 12b model - which are priced at about the same level by independent cloud vendors (c.$0.15 per million tokens vs $0.16/$0.60 for GPT-4o-mini). We know that GPT-4 was mixture of experts, we don't know if others are or not (I doubt GPT-4o and GPT-4o-mini are) so that might slightly complicate the picture. I think using rate limits might be a slightly less noisy metric than price, since pricing is a commercial & competitive decision, while rate limits are probably just being set by the infra team. Anyhow, take it for what it is - just a thought with some basic maths over the top, could easily be wrong!

  • No alternative text description for this image
Jean Ibarz

PhD in Computer Science | Generative AI Enthusiast | Dev Scientist at Aura Aero

1mo

12b active parameters for gpt-4o-mini seems plausible to me. However what makes you think it is not a Mixture of expert model ?

Duane K.

Principal Software Engineer | FANG+ | Startups | Engineering Leadership | Full-Stack | DevOps | Web Dev | InfoSec | A.I. | Python | C#/.Net | FinTech | Telecom | HealthTech | EdTech | GameDev | Mentor | Daily Learner

1mo

Network Rate limits and param size are different things and not at all connected with each other. Your talking about apples and oranges.

Benjamin Anderson

Stanford CS Grad, Chief Scientist, Taylor AI (YC S23)

1mo

it is alleged that 4o-mini is a mixture of MANY experts. i heard many many (10s to 100s) x 3B. that would explain why it's capable but still relatively cheap and fast to run.

Daniel Mutch

Head of AI | Hands-on engineering leader of large teams that deliver 25% faster

1mo

Interesting... Because I also like speculating... that seems pretty small for 4o. Maybe it's 110 (1/8th) and the other 2x comes from hardware improvements?

Like
Reply
Mattia Notari

System Integrator, Ai Engineer & LLM Ops

1mo

There's something about the limits that doesn't add up for me. Those who use open-source models know that the speed changes based on the context sent to the model. OpenAI manages to maintain the same speed regardless of the context and for all users. This requires a very high level of complexity, and I believe those ratios are too aggressive: serving user requests faster exponentially increases the free load on the servers. Does it make sense to take this into consideration?

Like
Reply
Bryan Brownlie

Emerald Strategy Group: Strategic Advisory - M&A - Transaction & Project Financing - Due Diligence - Private Equity - Renewable Energy

1mo

This is interesting. Another would be to look at the scaling laws (which are brilliantly explained in Meta's 'Herd of Llama Models' paper), and kind of work backwards. Performance on benchmarks, can be tied back mathematically to parameters. (That's not the purpose of scaling laws, they're for optimizing the number of parameters to a specific training budget, but they do also predict the benchmark performance of that number of parameters). It too would be wildly speculative, as 4o-mini is clearly a pruned GPT-4o, so is likely more efficient than a trained model, but it should ballpark it!

Like
Reply
Gus Bekdash

Top Voice in strategy & AI. Turn Ideas into Results: v CTO, Chief Architect & Strategist focused on growth ✪ $Billion+ solutions ✪ AI Expert ✪ Executive ✪ Author ✪ Consultant

1mo

Peter Gostev, SPECULATION ALERT: GPT is a multi-threaded, multi-nodal, system, and the rate limit per channel or user must be a function of many tweakable parameters that distribute the system resources (aggregate rate limit) over a number of concurrent users. The number of parameters could affect that, but there are so many variables here that assuming any proportionality with a single variable is a very wild guess on top of an assumption that all other things are equal. Nevertheless, it's an interesting experiment.

Like
Reply
Swati Jain Goel

Co-founder & Principal data scientist @InteligenAI | Maximizing your business potential through full-stack AI solution development service

1mo

Rate limits could also have a commercial angle, designed to encourage upgrading to higher tiers or newer models. This could skew the relationship between rate limits and model size.

See more comments

To view or add a comment, sign in

Explore topics