Sometimes, it's NOT just about speed. You can get somewhere very quickly and very wrong. Here is a thorough study on how quantization does non-negligible damage to the quality of LLAMA3 models. https://lnkd.in/gQQVzisb Meanwhile, SambaNova Systems is running LLAMA3 at 400+ tokens/s on its native precision 🚀🚀🚀 . Trust me, you will sleep better knowing your enterprise solutions are running without degradations. Good night 😴😴😴 Try it out here before going to bed: https://fast.snova.ai/
Really makes you think where else quantization might be reducing the quality of models 🤔
Well said Kaizhao Liang!
Great points, Kaizhao Liang!
AI Architect
3monote for llama2 70 training meta achieved 250 toks/gpu/sec in 2000 A100 system 1 year ago in pretrain. as for inference , the latest MLcomm benchmark shows 70 20000 toks/8gpu/sec , where you can verify quickly that 7B llama inference can be achieved with 3500 toks/gpu/sec . with 800 Gb/s network card, and optimization, 800 toks/sec should be expected in llama 2 pre training and 5000 toks/sec in llama2 inference.