Looking for a vector database that stands out from the rest? Look no further! Our #vectordatabase offers unparalleled speed and efficiency. Here's how we're different: 1. Search text, relational, and #unstructured data 7-20x faster than OpenSearch 2. Compress time data to 1% the size of other vector databases 3. Unique time search uses fractional memory (19 MB vs. 2.4 GB), no embedding needed 4. Unlimited time windows for retrieval vs. minutes for competitors 5. All benefits obtained using CPUs (no need for specialized – and expensive – GPUs) Get in touch with us if you want to learn more! #genai #largelanguagemodels #vectorsearch #vectordb https://lnkd.in/ge3qYbUc
Meilin Xu’s Post
More Relevant Posts
-
Anyone in for an early Christmas present? Snowpark Container Services is now in public preview! It allows you to bring any workload in any language any runtime any code to run right next to your data - and even power the compute via CPUs or GPUs. This makes Snowflake the ideal home for a whole host of new workloads: LLMs, Model Training in Data science, Data Applications and even more advanced Data Processing and Data Engineering. https://lnkd.in/eXSwK__S
Unlock the New Wave of Gen AI With Snowpark Container Services
snowflake.com
To view or add a comment, sign in
-
Building the future of AI infra. Apache Software Foundation Member; Former Head of Open Source (Ray) + Head of Field Engineering @ Anyscale.
If you are building #RAG applications, the first step is to create embeddings from your data. Checkout how you can do this with #Ray and Anyscale at scale and with unprecedented efficiency (90% cost reduction 🤯 vs. #OpenAI embedding API). If you are interested in trying this, please fill out https://lnkd.in/gDnvqZvR This is achieved through Ray's *unified* support for all steps of the pipeline with full *flexibility* on hardware / ML model for each operation (e.g. a mix of CPU, A10, and A100). 🚀 https://lnkd.in/gFeZ5akX (bit.ly/rag-embedding) Great collaboration with Pinecone team (check out their major launch https://lnkd.in/guYY7KPF!). Nathan Cordeiro Ram Sriharsha Roy Miara ❤️
RAG at Scale: 10x Cheaper Embedding Computations with Anyscale and Pinecone
anyscale.com
To view or add a comment, sign in
-
OCI GPU Clusters backed by high performance file solutions to feed them are a potent combo!
Bing Chat is so GPU-hungry, Microsoft will rent Oracle's
theregister.com
To view or add a comment, sign in
-
You often hear me speak about domain-specific computing and why it is so critical in reconstructing search to support modern real-time data retrieval. I’ll start by emphasizing again that legacy software-based search solutions have long reached their glass ceiling and are not able to support real time search at a billion-scale without compromise of price, speed, or relevancy. The reason domain-specific computing is so powerful is that it skips the standard software semantics, cache hierarchy, and other CPU abstractions. It then implements the core parts as a custom datapath processor. Together with a new software stack, this runs search and information retrieval workloads hundreds of times faster. Furthermore, Hyperspace Cloud processing unit includes tens of dedicated search cores running proprietary instruction sets to filter, rank, and aggregate search results in a super-efficient way. These custom search instructions, along with advanced data prefetching, enable speeds that can’t be matched by general-purpose CPUs. Interested in learning more about how you can shatter the limits of your search? https://lnkd.in/dGVtdfC2 #elasticsearch #dataretrieval #vectorsearch #genai #llm #keywordsearch #lexicalsearch #database.
To view or add a comment, sign in
-
Read this detailed blog on how GPU optimization works for training, fine-tuning, or inference for LLMs, or LMMs.
Have you wondered how can you optimize your compute for building LLMs? Apart from the model architecture and dataset, the other most important component of building LLMs is - infrastructure (GPUs, TPUs, or LPUs)! To train, fine-tune, or run inferencing on your models there are extensive compute requirements along with costs. Hence it is critical to optimize the infrastructure to support your workloads: https://lnkd.in/dJFPwwJz I have authored a detailed blog on how GPU optimization works for training, fine-tuning, or inference for LLMs, or LMMs. PS: At this point, most models being built are multimodal, hence I like using the term LMMs instead of LLMs. If you found the article insightful, share this with your network ♻️ ------------------------------------------------- If you don't want to miss any of my posts, go to my profile (Aishwarya Srinivasan) and click on the 🔔 to get notified about all my posts ❤️
GPU scaling for AI workload optimization
aishwaryasrinivasan.substack.com
To view or add a comment, sign in
-
Technology Sherpa with opinions on driving innovation (with governance) through the differentiated use of digital - Data, Apps, and Infrastructure.
Investing $10M & 2months, Databricks releases an OSS MoE (16 total) GenAI model in DBRX (Base / Instruct) that outperforms current OSS alternatives that can be privately deployed using MosaicML (and 4x H100 GPUs w/320GB of RAM) - no image as of yet & same restrictions on use as Llama (700M users).
GitHub - databricks/dbrx: Code examples and resources for DBRX, a large language model developed by Databricks
github.com
To view or add a comment, sign in
-
The engineer in me couldn't resist running the numbers since I have always been curious about running LLMs on very low latency, high throughput, and minimal GPUs. As the paper mentions, BitNet b1.58 matches 16-bit LLM baselines. The existing alternative to this, 4-bit quantization, still looses on precision and performance of the LLM. If this becomes a reality while maintaining precision, it could truly revolutionize the game. For a 7-billion parameter language model (LLM) like Mistral or Llama: Total Memory = Number of weights × Precision size Total Memory Requirement: For 32-bit precision = 7 billion × 4 bytes = 26+ GB For 1-bit precision = 7 billion × 1/8 bytes = 0.8+ GB
Microsoft shows extreme 1.58bit quantization not only improves cost, latency and throughput.... but also matches or improves #LLM performance too 🤯 👇👇 Last year MSFT introduced BitNet the 1bit architecture. BitNet represents parameters in binary instead of full-precision, reducing matrix multiplication, which significantly reduces compute. Building on this, the team explored a 1.58bit architecture using Ternary values, this maintains the efficiency benefits, but also matches or slightly outperforms unquantized transformers! If broadly adopted the implications on compute scaling laws could be huge. 🎉 Binary = {1, 0} Ternary = {-1, 0, 1} full-precision = FP16 / BF16 (e.g. 0.2843) 𝐌𝐞𝐭𝐡𝐨𝐝: 1️⃣ Adopts similar #llama2 architecture 0️⃣ Replaces nn.Linear with BitLinear layer from BitNet 1️⃣ Represents weights in 1.58Bit Ternary values 0️⃣ Trained models from 700M to 70B on 100B tokens of Redpajama dataset 1️⃣ Compared against equivalently trained Llama models with same data 0️⃣ Trained additional 3B model on 2T token to compare with StableLM 3B 𝐑𝐞𝐬𝐮𝐥𝐭𝐬: 1️⃣ 1.58bit models equal or outperform unquantized models above 3B size 0️⃣ 3B model uses 3.5x less memory and is 2.7x faster than equivalent Llama 1️⃣ 70B BitNet model is 4x faster with 8.9x greater throughput 0️⃣ 70B is as efficient as a full precision 13B Llama model 1️⃣ 3B trained on the same 2T tokens beats StableLM 3B 𝐈𝐧𝐬𝐢𝐠𝐡𝐭𝐬: 1️⃣ 1.58bit quantization on models below 3B limits ability to capture information 0️⃣ Performance on 2T tokens vs StableLM suggest this approach can scale 1️⃣ If the approach generalizes and is adopted mainstream compute scaling laws will radicaly change making #LLMs much more accessible Paper 👉 https://lnkd.in/giXZw8_u
To view or add a comment, sign in
-
Super excited to share some of the techniques we are using for efficient LLM inference! Many thanks to Daya Khudia and the team for this hard work. Some quick things to know: 1) Use a good server like vllm or TRT-LLM 2) Good servers have continuous batching which will group incoming requests together on a per-token level to balance latency and throughput 3) At lower batch sizes, inference latency is determined primarily by the amount of memory bandwidth. At high batch sizes it is mainly a factor of FLOPS. 4) Mixture of Expert models are awesome because fewer parameters are in use at a given time, reducing demands on memory bandwidth and FLOPs. 5) Quantization helps with little accuracy loss since it can cut memory bandwidth in proportion to the quantization size. Also FP8 on newer cards can double the effective FLOPs! https://lnkd.in/gszuGAjz
Fast, Secure and Reliable: Enterprise-grade LLM Inference
databricks.com
To view or add a comment, sign in
-
Graph semantics and AI trends strategist / Change consultant / Veteran researcher, analyst and reporter
A year ago, Denny Vrandečić of the Wikimedia Foundation, one of the key people behind Wikidata, made some telling remarks about when not to use LLMs during a keynote at the The Knowledge Graph Conference: “Why would you ever use a 96 layer LLM with 175 billion parameters to generate a multiplication, which is a single operation with a CPU? Just because an LLM can do these kinds of things doesn’t mean they should be. Why should you be generating knowledge again and again when you can just look it up in a confident way?…. It’s just not very efficient.” What's happened in the past year since Vrandečić made this observation? Platform providers such as Fluree are making it possible to harness the power of LLMs and knowledge graphs together in ways that allow more accuracy, efficiency and security. https://t.ly/MyPQ8
GenAI Maturity: From Productivity To Effectiveness - DataScienceCentral.com
https://www.datasciencecentral.com
To view or add a comment, sign in
-
Really interesting work by the folks at Voltron Data on Theseus, their new OLAP GPU query engine for running on distributed GPU clusters. Turns out that for truly insanely large workloads, like somewhere between 30TBs to 100TB per query, making optimal use of GPUs is more cost-effective than Spark - plus, the overhead of dealing with 100+ nodes clusters. What we call Big Data today is very different than what we used to 10 years ago. 95% of enterprises never see this kind of loads, but it's still...really cool. I really enjoyed the write up with all the benchmarks (and the conversion from Spark costs to Big Mac 😄 ). https://lnkd.in/dpt9EmjV
Benchmarking Report: Theseus Engine
voltrondata.com
To view or add a comment, sign in