Zhe Zhang’s Post

Building the future of AI infra. Apache Software Foundation Member; Former Head of Open Source (Ray) + Head of Field Engineering @ Anyscale.

7mo Edited

If you are building #RAG applications, the first step is to create embeddings from your data. Checkout how you can do this with #Ray and Anyscale at scale and with unprecedented efficiency (90% cost reduction 🤯 vs. #OpenAI embedding API). If you are interested in trying this, please fill out https://lnkd.in/gDnvqZvR This is achieved through Ray's *unified* support for all steps of the pipeline with full *flexibility* on hardware / ML model for each operation (e.g. a mix of CPU, A10, and A100). 🚀 https://lnkd.in/gFeZ5akX (bit.ly/rag-embedding) Great collaboration with Pinecone team (check out their major launch https://lnkd.in/guYY7KPF!). Nathan Cordeiro Ram Sriharsha Roy Miara ❤️

RAG at Scale: 10x Cheaper Embedding Computations with Anyscale and Pinecone

anyscale.com

3 Comments

Zhe Zhang

Building the future of AI infra. Apache Software Foundation Member; Former Head of Open Source (Ray) + Head of Field Engineering @ Anyscale.

7mo

If you have an embedding workload and are interested in trying this out, please fill https://forms.gle/gSQrj6XDAVQGkRvk8 Thanks!

1 Reaction

Rohan Paul

Bridging the gap between AI research and practical applications. → Join my LLM Newsletter. AI Engineer and Entrepreneur (Ex Investment Banking)

7mo

Awesome - 10x cost reduction to generate the 1 billion embeddings using OpenAI is $60,000 vs $6000 with Anyscale

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Robert Kim, MBA

Technology Sherpa with opinions on driving innovation (with governance) through the differentiated use of digital - Data, Apps, and Infrastructure.
5mo
Report this post
Investing $10M & 2months, Databricks releases an OSS MoE (16 total) GenAI model in DBRX (Base / Instruct) that outperforms current OSS alternatives that can be privately deployed using MosaicML (and 4x H100 GPUs w/320GB of RAM) - no image as of yet & same restrictions on use as Llama (700M users).

GitHub - databricks/dbrx: Code examples and resources for DBRX, a large language model developed by Databricks

github.com
Like Comment
To view or add a comment, sign in
Fabiano Aquilani

Account Executive @Datadog | Helping and Enabling companies monitoring and visibility in the cloud age
11mo
Report this post
Datadog’s Container Overview dashboard now gives you out-of-the-box visibility into CPU, memory, and network metrics from containers running OTel-instrumented applications. Learn more about this and other features announced at DASH 2023: https://lnkd.in/ehRu5SKg Datadog #monitoring #observability #innovation

DASH 2023: Guide to Datadog's newest announcements

datadoghq.com
Like Comment
To view or add a comment, sign in
Meilin Xu
9mo
Report this post
Looking for a vector database that stands out from the rest? Look no further! Our #vectordatabase offers unparalleled speed and efficiency. Here's how we're different: 1. Search text, relational, and #unstructured data 7-20x faster than OpenSearch 2. Compress time data to 1% the size of other vector databases 3. Unique time search uses fractional memory (19 MB vs. 2.4 GB), no embedding needed 4. Unlimited time windows for retrieval vs. minutes for competitors 5. All benefits obtained using CPUs (no need for specialized – and expensive – GPUs) Get in touch with us if you want to learn more! #genai #largelanguagemodels #vectorsearch #vectordb https://lnkd.in/ge3qYbUc

KX Launches KDB.AI Server Edition For Enterprise-Scale Generative AI | KX

https://kx.com
Like Comment
To view or add a comment, sign in
Gabrielle Davelaar

Strategic AI Partnership Builder | Driving Innovation & Collaboration in Artificial Intelligence
7mo
Report this post
Microsoft just released a paper on splitwise. An optimization tooling for LLM inferencing and therefore reducing the amount of GPU's required. In short: machines are maintained in different pools and dedicated to the two distinct LLM inference phases. The mixed pool grows and reduces according to runtime demand. KV-cache encompassing the state of the query after the prompt phase is transferred from the prompt machines to the token machines over InfiniBand with very low latency. Check out the blogpost to learn more: https://lnkd.in/e7QX4BVx Alan Weaver Kathryn Jesaitis Papandrew Alex Zeltov Eyas Taifour Răzvan Tănase

Splitwise improves GPU usage by splitting LLM inference phases

https://www.microsoft.com/en-us/research

2 Comments
Like Comment
To view or add a comment, sign in
Cameron Bahar

SVP & GM- IaaS @ OCI | Cloud & Distributed Systems Pioneer
9mo
Report this post
OCI GPU Clusters backed by high performance file solutions to feed them are a potent combo!

Bing Chat is so GPU-hungry, Microsoft will rent Oracle's

theregister.com

5 Comments
Like Comment
To view or add a comment, sign in
Prodigy Education

51,434 followers
8mo Edited
Report this post
In the latest Engineering Blog post by Prodigy Education, Staff Site Reliability Engineer Erik Krieg takes a look at how to run a GPU-accelerated open-source Large Language Model (LLM) inference workload using Elastic Kubernetes Service (EKS). https://lnkd.in/geSPErxG #insideprodigy #largelanguagemodels #engineeringblog

Running GPU-Accelerated LLM Workloads on EKS

medium.com
Like Comment
To view or add a comment, sign in
AIPressRoom

132 followers
1w
Report this post
#Topics PyTorch/XLA 2.4 improves Pallas and adds “eager mode” [ad_1] And now, instead of having to call xm.mark_step() you can call torch_xla.sync()instead. These improvements make it easier to convert your code over to PyTorch/XLA and improve the developer workflow. For more changes to API calls, check out the release notes. Experimental eager mode If you’ve been working with PyTorch/XLA for a while, you know that we refer to models being “lazily executed.” That means that PyTorch/XLA creates the compute graph of operation before sending models over to be executed on the XLA device target hardware. With new eager mode, operations are compiled and then immediately executed on the target hardware. The catch to this feature though is that TPUs themselves do not have a true eager mode, since each instruction is not sent to the TPU by default right away. On TPUs, we achieve this by adding a “mark_step” call after each PyTorch operation to force the compilation and execution. This results in the functionality of eager mode but as an emulation rather than as a native feature. Our intent with eager mode in this release is not to run it in your production environment, but rather in your own local environments. We hope that eager mode makes it easier to debug your models ...

PyTorch/XLA 2.4 improves Pallas and adds “eager mode”

https://aipressroom.com
Like Comment
To view or add a comment, sign in
Rick Tolan

Digital Transformation | Enterprise Software | Intelligent Automation | Generative AI | DataOps | IDP Automation | OCR | NLP | RPA | Process Improvement | Process Mining | Content Services
8mo Edited
Report this post
Eduardo Alvarez does well in this article. I especially like his attention to scalability, fidelity and latency. Using RAG we can keep a focus on the data pipeline. Additionally the pre and post processing retrieval offloads the work to CPU for more responsive actions (serving a narrow audience over the billions served). Utility via CPU infrastructure is a unique take. The examples are good and the clarity includes the vector embed in an easy to understand walk through.

Retrieval Augmented Generation (RAG) Inference Engines with LangChain on CPUs

towardsdatascience.com
Like Comment
To view or add a comment, sign in
Steven Coochin

Lilypad Network Chief Innovation Officer
3mo
Report this post
Lilypad is currently free to use and lets you run your own containerised workloads using connected GPUs for heavier compute processing. Here's a blog post I wrote on how to get started with your own job modules optimised to run on the Lilypad Network. https://lnkd.in/gg4TdqsK

How to: Build a custom job module on Lilypad

blog.lilypadnetwork.org
Like Comment
To view or add a comment, sign in
Aishwarya Srinivasan Aishwarya Srinivasan is an Influencer
4mo
Report this post
Have you wondered how can you optimize your compute for building LLMs? Apart from the model architecture and dataset, the other most important component of building LLMs is - infrastructure (GPUs, TPUs, or LPUs)! To train, fine-tune, or run inferencing on your models there are extensive compute requirements along with costs. Hence it is critical to optimize the infrastructure to support your workloads: https://lnkd.in/dJFPwwJz I have authored a detailed blog on how GPU optimization works for training, fine-tuning, or inference for LLMs, or LMMs. PS: At this point, most models being built are multimodal, hence I like using the term LMMs instead of LLMs. If you found the article insightful, share this with your network ♻️ ------------------------------------------------- If you don't want to miss any of my posts, go to my profile (Aishwarya Srinivasan) and click on the 🔔 to get notified about all my posts ❤️

GPU scaling for AI workload optimization

aishwaryasrinivasan.substack.com

7 Comments
Like Comment
To view or add a comment, sign in

6,886 followers

View Profile Follow

Zhe Zhang’s Post

RAG at Scale: 10x Cheaper Embedding Computations with Anyscale and Pinecone

anyscale.com

More from this author

Impact of Large Requests in Shared Services

Good luck Erfan

Explore topics