Kaizhao Liang’s Post

View profile for Kaizhao Liang, graphic

ML @ SambaNova Systems

Sometimes, it's NOT just about speed. You can get somewhere very quickly and very wrong. Here is a thorough study on how quantization does non-negligible damage to the quality of LLAMA3 models. https://lnkd.in/gQQVzisb Meanwhile, SambaNova Systems is running LLAMA3 at 400+ tokens/s on its native precision 🚀🚀🚀 . Trust me, you will sleep better knowing your enterprise solutions are running without degradations. Good night 😴😴😴 Try it out here before going to bed: https://fast.snova.ai/

AK (@_akhaliq) on X

AK (@_akhaliq) on X

twitter.com

note for llama2 70 training meta achieved 250 toks/gpu/sec in 2000 A100 system 1 year ago in pretrain. as for inference , the latest MLcomm benchmark shows 70 20000 toks/8gpu/sec , where you can verify quickly that 7B llama inference can be achieved with 3500 toks/gpu/sec . with 800 Gb/s network card, and optimization, 800 toks/sec should be expected in llama 2 pre training and 5000 toks/sec in llama2 inference.

Like
Reply
Anand Sampat

We are Hiring! 🤖 ML Executive | 👨🏾💻 Builder | 🎙️ Podcaster & Writer

3mo

Really makes you think where else quantization might be reducing the quality of models 🤔

Richard Halkett

Ex-Amazon, ex-Cisco technology executive; Chief Revenue Officer & Chief Customer Officer, SambaNova Systems

3mo

Great points, Kaizhao Liang!

Like
Reply
See more comments

To view or add a comment, sign in

Explore topics