Meilin Xu’s Post

9mo

Looking for a vector database that stands out from the rest? Look no further! Our #vectordatabase offers unparalleled speed and efficiency. Here's how we're different: 1. Search text, relational, and #unstructured data 7-20x faster than OpenSearch 2. Compress time data to 1% the size of other vector databases 3. Unique time search uses fractional memory (19 MB vs. 2.4 GB), no embedding needed 4. Unlimited time windows for retrieval vs. minutes for competitors 5. All benefits obtained using CPUs (no need for specialized – and expensive – GPUs) Get in touch with us if you want to learn more! #genai #largelanguagemodels #vectorsearch #vectordb https://lnkd.in/ge3qYbUc

KX Launches KDB.AI Server Edition For Enterprise-Scale Generative AI | KX

https://kx.com

To view or add a comment, sign in

More Relevant Posts

Ulrich Baum

❄️ Data applications work best in the Data Cloud ❄️
8mo
Report this post
Anyone in for an early Christmas present? Snowpark Container Services is now in public preview! It allows you to bring any workload in any language any runtime any code to run right next to your data - and even power the compute via CPUs or GPUs. This makes Snowflake the ideal home for a whole host of new workloads: LLMs, Model Training in Data science, Data Applications and even more advanced Data Processing and Data Engineering. https://lnkd.in/eXSwK__S

Unlock the New Wave of Gen AI With Snowpark Container Services

snowflake.com
Like Comment
To view or add a comment, sign in
Zhe Zhang

Building the future of AI infra. Apache Software Foundation Member; Former Head of Open Source (Ray) + Head of Field Engineering @ Anyscale.
7mo Edited
Report this post
If you are building #RAG applications, the first step is to create embeddings from your data. Checkout how you can do this with #Ray and Anyscale at scale and with unprecedented efficiency (90% cost reduction 🤯 vs. #OpenAI embedding API). If you are interested in trying this, please fill out https://lnkd.in/gDnvqZvR This is achieved through Ray's *unified* support for all steps of the pipeline with full *flexibility* on hardware / ML model for each operation (e.g. a mix of CPU, A10, and A100). 🚀 https://lnkd.in/gFeZ5akX (bit.ly/rag-embedding) Great collaboration with Pinecone team (check out their major launch https://lnkd.in/guYY7KPF!). Nathan Cordeiro Ram Sriharsha Roy Miara ❤️

RAG at Scale: 10x Cheaper Embedding Computations with Anyscale and Pinecone

anyscale.com

3 Comments
Like Comment
To view or add a comment, sign in
Cameron Bahar

SVP & GM- IaaS @ OCI | Cloud & Distributed Systems Pioneer
9mo
Report this post
OCI GPU Clusters backed by high performance file solutions to feed them are a potent combo!

Bing Chat is so GPU-hungry, Microsoft will rent Oracle's

theregister.com

5 Comments
Like Comment
To view or add a comment, sign in
Ohad Levi

Co-founder & CEO at Hyperspace | High-Performance Search | Domain-Specific-Computing
1mo
Report this post
You often hear me speak about domain-specific computing and why it is so critical in reconstructing search to support modern real-time data retrieval. I’ll start by emphasizing again that legacy software-based search solutions have long reached their glass ceiling and are not able to support real time search at a billion-scale without compromise of price, speed, or relevancy. The reason domain-specific computing is so powerful is that it skips the standard software semantics, cache hierarchy, and other CPU abstractions. It then implements the core parts as a custom datapath processor. Together with a new software stack, this runs search and information retrieval workloads hundreds of times faster. Furthermore, Hyperspace Cloud processing unit includes tens of dedicated search cores running proprietary instruction sets to filter, rank, and aggregate search results in a super-efficient way. These custom search instructions, along with advanced data prefetching, enable speeds that can’t be matched by general-purpose CPUs. Interested in learning more about how you can shatter the limits of your search? https://lnkd.in/dGVtdfC2 #elasticsearch #dataretrieval #vectorsearch #genai #llm #keywordsearch #lexicalsearch #database.
Like Comment
To view or add a comment, sign in
Illuminate AI

26,405 followers
4mo
Report this post
Read this detailed blog on how GPU optimization works for training, fine-tuning, or inference for LLMs, or LMMs.

Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
4mo

Have you wondered how can you optimize your compute for building LLMs? Apart from the model architecture and dataset, the other most important component of building LLMs is - infrastructure (GPUs, TPUs, or LPUs)! To train, fine-tune, or run inferencing on your models there are extensive compute requirements along with costs. Hence it is critical to optimize the infrastructure to support your workloads: https://lnkd.in/dJFPwwJz I have authored a detailed blog on how GPU optimization works for training, fine-tuning, or inference for LLMs, or LMMs. PS: At this point, most models being built are multimodal, hence I like using the term LMMs instead of LLMs. If you found the article insightful, share this with your network ♻️ ------------------------------------------------- If you don't want to miss any of my posts, go to my profile (Aishwarya Srinivasan) and click on the 🔔 to get notified about all my posts ❤️

GPU scaling for AI workload optimization

aishwaryasrinivasan.substack.com
Like Comment
To view or add a comment, sign in
Robert Kim, MBA

Technology Sherpa with opinions on driving innovation (with governance) through the differentiated use of digital - Data, Apps, and Infrastructure.
5mo
Report this post
Investing $10M & 2months, Databricks releases an OSS MoE (16 total) GenAI model in DBRX (Base / Instruct) that outperforms current OSS alternatives that can be privately deployed using MosaicML (and 4x H100 GPUs w/320GB of RAM) - no image as of yet & same restrictions on use as Llama (700M users).

GitHub - databricks/dbrx: Code examples and resources for DBRX, a large language model developed by Databricks

github.com
Like Comment
To view or add a comment, sign in
Satyajeet Singh

Myntra Data and ML Engineering
5mo
Report this post
The engineer in me couldn't resist running the numbers since I have always been curious about running LLMs on very low latency, high throughput, and minimal GPUs. As the paper mentions, BitNet b1.58 matches 16-bit LLM baselines. The existing alternative to this, 4-bit quantization, still looses on precision and performance of the LLM. If this becomes a reality while maintaining precision, it could truly revolutionize the game. For a 7-billion parameter language model (LLM) like Mistral or Llama: Total Memory = Number of weights × Precision size Total Memory Requirement: For 32-bit precision = 7 billion × 4 bytes = 26+ GB For 1-bit precision = 7 billion × 1/8 bytes = 0.8+ GB
Andrew Jardine

🟠 Adaptive ML 🟠
6mo

Microsoft shows extreme 1.58bit quantization not only improves cost, latency and throughput.... but also matches or improves #LLM performance too 🤯 👇👇 Last year MSFT introduced BitNet the 1bit architecture. BitNet represents parameters in binary instead of full-precision, reducing matrix multiplication, which significantly reduces compute. Building on this, the team explored a 1.58bit architecture using Ternary values, this maintains the efficiency benefits, but also matches or slightly outperforms unquantized transformers! If broadly adopted the implications on compute scaling laws could be huge. 🎉 Binary = {1, 0} Ternary = {-1, 0, 1} full-precision = FP16 / BF16 (e.g. 0.2843) 𝐌𝐞𝐭𝐡𝐨𝐝: 1️⃣ Adopts similar #llama2 architecture 0️⃣ Replaces nn.Linear with BitLinear layer from BitNet 1️⃣ Represents weights in 1.58Bit Ternary values 0️⃣ Trained models from 700M to 70B on 100B tokens of Redpajama dataset 1️⃣ Compared against equivalently trained Llama models with same data 0️⃣ Trained additional 3B model on 2T token to compare with StableLM 3B 𝐑𝐞𝐬𝐮𝐥𝐭𝐬: 1️⃣ 1.58bit models equal or outperform unquantized models above 3B size 0️⃣ 3B model uses 3.5x less memory and is 2.7x faster than equivalent Llama 1️⃣ 70B BitNet model is 4x faster with 8.9x greater throughput 0️⃣ 70B is as efficient as a full precision 13B Llama model 1️⃣ 3B trained on the same 2T tokens beats StableLM 3B 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬: 1️⃣ 1.58bit quantization on models below 3B limits ability to capture information 0️⃣ Performance on 2T tokens vs StableLM suggest this approach can scale 1️⃣ If the approach generalizes and is adopted mainstream compute scaling laws will radicaly change making #LLMs much more accessible Paper 👉 https://lnkd.in/giXZw8_u
Like Comment
To view or add a comment, sign in
Joshua Hartman

Director of Engineering, Databricks ML Platform - I'm hiring!
5mo
Report this post
Super excited to share some of the techniques we are using for efficient LLM inference! Many thanks to Daya Khudia and the team for this hard work. Some quick things to know: 1) Use a good server like vllm or TRT-LLM 2) Good servers have continuous batching which will group incoming requests together on a per-token level to balance latency and throughput 3) At lower batch sizes, inference latency is determined primarily by the amount of memory bandwidth. At high batch sizes it is mainly a factor of FLOPS. 4) Mixture of Expert models are awesome because fewer parameters are in use at a given time, reducing demands on memory bandwidth and FLOPs. 5) Quantization helps with little accuracy loss since it can cut memory bandwidth in proportion to the quantization size. Also FP8 on newer cards can double the effective FLOPs! https://lnkd.in/gszuGAjz

Fast, Secure and Reliable: Enterprise-grade LLM Inference

databricks.com

1 Comment
Like Comment
To view or add a comment, sign in
Alan Morrison

Graph semantics and AI trends strategist / Change consultant / Veteran researcher, analyst and reporter
1mo
Report this post
A year ago, Denny Vrandečić of the Wikimedia Foundation, one of the key people behind Wikidata, made some telling remarks about when not to use LLMs during a keynote at the The Knowledge Graph Conference: “Why would you ever use a 96 layer LLM with 175 billion parameters to generate a multiplication, which is a single operation with a CPU? Just because an LLM can do these kinds of things doesn’t mean they should be. Why should you be generating knowledge again and again when you can just look it up in a confident way?…. It’s just not very efficient.” What's happened in the past year since Vrandečić made this observation? Platform providers such as Fluree are making it possible to harness the power of LLMs and knowledge graphs together in ways that allow more accuracy, efficiency and security. https://t.ly/MyPQ8

GenAI Maturity: From Productivity To Effectiveness - DataScienceCentral.com

https://www.datasciencecentral.com

4 Comments
Like Comment
To view or add a comment, sign in
Ciro Greco

Doing my thing
5mo
Report this post
Really interesting work by the folks at Voltron Data on Theseus, their new OLAP GPU query engine for running on distributed GPU clusters. Turns out that for truly insanely large workloads, like somewhere between 30TBs to 100TB per query, making optimal use of GPUs is more cost-effective than Spark - plus, the overhead of dealing with 100+ nodes clusters. What we call Big Data today is very different than what we used to 10 years ago. 95% of enterprises never see this kind of loads, but it's still...really cool. I really enjoyed the write up with all the benchmarks (and the conversion from Spark costs to Big Mac 😄 ). https://lnkd.in/dpt9EmjV

Benchmarking Report: Theseus Engine

voltrondata.com

2 Comments
Like Comment
To view or add a comment, sign in

3,269 followers

122 Posts

View Profile Follow

Meilin Xu’s Post

More Relevant Posts

Explore topics