Progress in Gen AI and Open-Source LLMs, New Product Launches, and Educational Resources
Provectus AI review #3

Progress in Gen AI and Open-Source LLMs, New Product Launches, and Educational Resources

Generative AI and Large Language Models (LLMs) are swiftly transforming the global landscape. Their broad range of use cases, spanning both technical and business landscapes, continues to expand daily, challenging us to keep up to date on all the latest developments. 

The Provectus team is committed to delivering regular comprehensive overviews of the field. In this edition of the "Provectus AI Review" series, we explore the latest technical and educational advancements in Gen AI and LLMs.

Technical Updates and Resources

Fine-tuning Falcon-40B and other LLMs has never been easier, thanks to Amazon SageMaker Studio notebooks and QLoRA. Hugging Face and AWS released an amazing tutorial, where you can learn how to precisely fine-tune open-source LLMs, specifically the Falcon-40B LLM.

The tutorial discusses a process for fine-tuning large language models (LLMs) using Amazon SageMaker, the Hugging Face's Parameter-Efficient Fine-Tuning (PEFT) library, alongside quantization techniques via the bitsandbytes platform, to support interactive fine-tuning of extremely large models, all within a single notebook instance.

While the tutorial focuses on fine-tuning Falcon-40B using a single ml.g5.12xlarge instance which includes 4 A10G GPUs, the same strategy can be effectively applied to even larger models using p4d/p4de notebook instances.

However, large models do not often fit into memory on a single or even several GPUs. To overcome this limitation, it is proposed to use a new technique called Quantized LLMs with Low-Rank Adapters (QLoRA). This approach effectively reduces memory usage of LLMs while maintaining performance. 

The method is explained in more detail in a blog post by Hugging Face, and by the authors of the QLoRA paper, which also discusses its integration with the Transformers and PEFT libraries.


UC Berkeley has developed vLLM, an open-source library that provides a quicker, more cost-effective alternative for LLM inference and serving. 

No alt text provided for this image

vLLM, an open-source library introduced for fast large language models (LLMs) inference and serving, leverages PagedAttention, a novel attention algorithm for managing attention keys and values more efficiently. The performance enhancements achieved by vLLM and PagedAttention outpace current standards in LLM serving, offering up to 24x higher throughput than HuggingFace Transformers, without necessitating any alterations to the model's architecture. 

One research organization that adopted vLLM as their backend reported efficient handling of peak traffic, improved performance with a 5x increase, reduced operational costs, and enhanced resource utilization.

The researchers mentioned memory-related issues as a primary bottleneck for LLM performance. PagedAttention optimizes the autoregressive decoding process, which is primarily bottlenecked by memory issues. During the decoding process, each input token to the LLM generates corresponding attention key and value tensors, also known as the KV cache, which are maintained in GPU memory for the generation of future tokens. This novel attention algorithm allows for memory blocks to be non-contiguous, thereby enabling more flexible memory management analogous to the OS's virtual memory system. This results in 'blocks' being thought of as 'pages', 'tokens' as 'bytes', and 'sequences' as 'processes'.

This approach curbs memory wastage to less than 4% and facilitates efficient memory sharing during parallel sampling. This near-optimal memory usage increases the system's ability to batch sequences, improving GPU utilization and thereby significantly enhancing throughput. The efficient memory sharing of PagedAttention reduces the memory overhead of complex sampling algorithms, such as parallel sampling and beam search, lowering memory usage by up to 55%, which can lead to an improvement in throughput by up to 2.2x.

PagedAttention enables efficient memory sharing, which is crucial in scenarios like parallel sampling where multiple output sequences are generated from the same prompt. In this case, computation and memory for the prompt can be shared between output sequences. This ability to share memory is enabled through PagedAttention's block table. Similar to how physical pages are shared across processes, different sequences in PagedAttention can share blocks by mapping their logical blocks to the same physical block. To ensure safe sharing, PagedAttention maintains the reference counts of the physical blocks and employs a Copy-on-Write mechanism.

vLLM, using PagedAttention, supports a range of models with a user-friendly interface and high performance. Further technical details about vLLM and PagedAttention can be found in the corresponding GitHub repository, with a detailed paper forthcoming. Easy to install, vLLM is available for both offline inference and online serving.


This article delves into the specifics of how to enhance PyTorch model training scalability through mixed-precision techniques and multi-GPU training strategies, focusing specifically on a Vision Transformer (ViT) model for image classification.

The focus is on harnessing the power of mixed-precision techniques and multi-GPU training paradigms, rather than resorting to low-level machine optimizations. 

Initially, the ViT model was trained on a basic dataset from scratch, taking around 60 minutes and achieving a test accuracy of 62%. However, training a deep learning model from scratch is computationally expensive and time-consuming. To address this, the article suggests using a pre-trained ViT model on a different dataset (ImageNet), which is then fine-tuned, leading to significant improvements: 95% test accuracy achieved in just 20 minutes — a significant improvement over the 62% accuracy achieved in 60 minutes when training from scratch. 

Mixed-precision training combines 16-bit and 32-bit precision to maintain accuracy while speeding up computation and reducing memory usage. Weights are converted from FP32 to FP16, reducing memory footprint and speeding computation. Gradients are computed using FP16 weights and then converted back to FP32 to maintain numerical stability during weight updates.

The training strategy utilized Brain Floating Point (bfloat16), a format developed by Google for machine learning applications, primarily for TPUs. Bfloat16 extends the dynamic range compared to the conventional float16 format, allowing the representation of very large and small numbers, which is essential for deep learning applications.

The article further introduces multi-GPU training with Fully Sharded Data Parallelism (FSDP), which leverages data parallelism (splitting mini-batches of data across GPUs) and tensor parallelism (splitting the model itself across GPUs). This technique reduced training time to about 2 minutes, a significant speedup.

Mixed-precision training drastically reduces training time without sacrificing predictive performance. Coupled with fully sharded data parallelism, this methodology facilitates quicker model training across multiple GPUs. As a result, training time was cut from 18 minutes to just 2 minutes (given access to 4 GPUs).

However, the question remains whether it is possible to apply these minimal changes to fine-tuning open-source LLMs as well.


The introduction of LongChat-7B-16K and LongChat-13B-16K marks a leap forward in extended context length, pushing the boundaries up to 16K tokens. Evaluation outcomes demonstrate that LongChat-13B's long-range retrieval accuracy is up to twice as high as that of other long context models, such as MPT-7B-storywriter (65K), MPT-30B-chat (8K), and ChatGLM2-6B (32k).

No alt text provided for this image

LongChat shows promise for bridging the gap between open models and proprietary long context models like Claude-100K and GPT-4-32K.

In another significant breakthrough, LMSYS successfully extended the context length of Metas LLaMA from 2048 to 16384 tokens. The training process can be conceptually divided into two steps:

  1. Condensing Rotary Embeddings: Here, they adapt the rotary position embedding, a type of positional embedding used in transformers, for larger sequence lengths. The position_ids > 2048 are condensed to fall within 0 to 2048, effectively reusing the pre-trained model weights. The condensation ratio, defined as the target context length divided by 2048, is 8 for the current models.
  2. Fine-tuning on Curated Conversation Data: After condensing the embeddings, the models are fine-tuned on a curated conversation dataset, the same dataset used for training Vicuna, cleaned and truncated to no longer than 16,000 tokens. Fine-tuning is carried out using standard next-token prediction loss, with 80k and 18k conversations used for the 7B and 13B models, respectively. To save memory, PyTorch's FSDP and Flash Attention are utilized. The costs for fine-tuning the 7B and 13B models are approximated at $300 and $700, respectively, assuming the use of an A100 on Cloud at $3/hour.

However, an accuracy drop is observed when the context length nears 16,000 tokens for LongChat-13B-16K during the fine-grained line retrieval task, presumably due to the near-maximal fine-tuning length. The authors suggest that training on longer documents could alleviate this issue and are planning to address this in the near future.

The team also developed an evaluation toolkit, LongEval, to assess text long-context capabilities. Their observations indicated reliable performance up to 12K tokens, with only a slight degradation thereafter.


OpenLLaMA 13B has been released, offering a competitive alternative to its original counterpart from MetaAI, LLaMA. With an average score of 0.57, OpenLLaMA compares closely with Meta's LLaMA 13B, making it a near-perfect replacement, suitable for a wide range of commercial applications.

The team has released the pre-trained models of various sizes (3B, 7B, 13B parameters) trained on 1 trillion tokens. The models' weights are available in both PyTorch and JAX under the Apache 2.0 license, and evaluation metrics comparing OpenLLaMA and the original LLaMA are also shared. The weights are made accessible in two formats: the EasyLM format for use within the EasyLM framework, and the PyTorch format compatible with the Hugging Face transformers library. 

The models are trained on the RedPajama dataset released by Together, a reproduction of the original LLaMA dataset, encompassing over 1.2 trillion tokens. The same preprocessing steps and training hyperparameters from the original LLaMA paper are employed, with the only difference being the dataset used. Training is executed on cloud TPU-v4s using the JAX-based EasyLM training pipeline developed specifically for training and fine-tuning large language models.

The approach integrates regular data parallelism with a fully sharded data parallelism (also referred to as ZeRO stage 3), striking a balance between training throughput and memory usage. This setup achieves a throughput exceeding 2200 tokens/second/TPU-v4 chip for the 7B model.

Alongside this, they have also provided evaluations of the lm-evaluation-harness, offering additional resources to help users understand and maximize the capabilities of this powerful tool.


MosaicML has introduced a new open-source Large Language Model (LLM). This model boasts a hefty 30B parameters, and has been released under the Apache 2.0 license, which allows for commercial use.

No alt text provided for this image

The release features two finely-tuned variants:

  1. MPT-30B-Instruct
  2. MPT-30B-Chat

MPT-30B-Instruct excels in single-turn instruction following, while MPT-30B-Chat performs best in multi-turn conversations. These variants build upon the strong foundation of the MPT-30B model.

The model was trained with an 8K token context window. It supports even longer contexts through ALiBi, and boasts efficient inference and training performance via FlashAttention. The MPT-30B family also displays remarkable coding abilities, thanks to its diverse pre-training data mixture.

Notably, this is the first LLM trained entirely on Nvidia H100s machines. The size of MPT-30B was carefully selected to facilitate easy deployment on a single GPU — either a 1xA100-80GB using 16-bit precision, or a 1xA100-40GB using 8-bit precision. This is a distinctive contrast to the Falcon 40B model.

Note: Initial comparisons suggest that MPT-30B competes well with other open-source models such as LLaMa-30B and Falcon-40B, though additional testing is needed to substantiate this claim.


MetaAI has unveiled Voicebox, a cutting-edge generative AI model for speech that delivers state-of-the-art performance across a variety of tasks.

Voicebox is a highly capable model that can synthesize speech in six languages, remove noise, edit content, transfer audio style, and more. It stands out for its ability to generalize across tasks, outperforming single-purpose AI models through in-context learning.

Unlike traditional autoregressive models, Voicebox possesses the unique capability to modify any part of a given sample, not just the end of a clip. It sets new standards by improving on word error rates and audio similarity, all while delivering a performance that is 20 times faster than the previous state-of-the-art.


Stability AI has announced the launch of SDXL 0.9, which offers a significant improvement in image and composition detail over its predecessor.

SDXL 0.9 is capable of generating hyper-realistic creations for a diverse range of applications, including films, television, music, and instructional videos. It also provides groundbreaking advancements for design and industrial use.

The model showcases a number of impressive capabilities, including:

  • Image-to-image prompting — inputting one image to generate variations of that image
  • Inpainting — reconstructing missing parts of an image
  • Outpainting — constructing a seamless extension of an existing image

SDXL 0.9 is designed for use on modern consumer GPUs. The system requirements include Windows 10 or 11, or a Linux operating system, 16GB RAM, and an Nvidia GeForce RTX 20 graphics card (or equivalent), equipped with a minimum of 8GB of VRAM.

You can access the model via ClipDrop, with an API coming soon. The research weights are now available, and an open release is planned for mid-July.

Other notable releases and updates:

Educational Resources

No alt text provided for this image

Andrew Ng, in collaboration with DeepLearning.AI and AWS, has created this course, hosted on Coursera. You can opt to audit the course or pay a fee of $49 for a certificate upon completion.

This three-week course covers a range of topics:

  1. Week 1: Explore generative AI use cases, the project lifecycle, and the process of model pre-training
  2. Week 2: Dive into fine-tuning and evaluating large language models
  3. Week 3: Learn about reinforcement learning and applications powered by Large Language Models

New Large Language Models Courses from DeepLearning.AI:

  1. LangChain for LLM Application Development
  2. LangChain Chat with Your Data

No alt text provided for this image

Databricks is offering a Professional Certificate Program on edX. This program is comprised of two courses. Auditing is available, but for those desiring a professional certificate, each course can be purchased for $99.

Designed as a three-month certification program (1.5 months per course), the program covers the following areas:

Course 1: "Large Language Models: Application through Production"

  1. Module 1: Applications with Large Language Models (LLMs)
  2. Module 2: Embeddings, Vector Databases, and Search
  3. Module 3: Multi-stage Reasoning
  4. Module 4: Fine-tuning and Evaluating LLMs
  5. Module 5: Society and LLMs: Bias and Safety
  6. Module 6: Large Language Model Operations (LLMOps)

Course 2: "Large Language Models: Foundation Models from the Ground Up"

  1. Module 1: Transformer Architecture: Attention & Transformer Fundamentals
  2. Module 2: Inside the Transformer I: Encoder Models
  3. Module 3: Inside the Transformer II: Decoder Models
  4. Module 4: Transfer Learning & Knowledge Distillation
  5. Module 5: Future Directions of LLMs

This program offers a comprehensive exploration of LLMs, from their foundational principles to their practical applications and future directions.

Conclusion

The AI realm forges ahead at a breakneck pace, with Generative AI and Large Language Models (LLMs) taking center stage. This “Provectus AI Review” covers notable advancements like efficient LLM fine-tuning with Amazon SageMaker Studio notebooks and QLoRA, the performance-boosting vLLM inference library, and the time-saving capabilities of the open-source Fabric library.

We've seen game-changers like extended context models LongChat-7B-16K and LongChat-13B-16K, OpenLLaMA 13B rivaling MetaAI's LLaMA, and the release of MosaicML's new open-source LLM, MPT-30B. Not to forget, MetaAI's speech synthesizing Voicebox and Stability AI's image-detailed SDXL 0.9 are breaking new ground.

As we keep pace with AI's progress, the impact and applications of LLMs continue to amaze. Stay with us as we bring you more from this field!


Author: Marlon Cajamarca Vega — Machine Learning Engineer & AI Educator || Provectus || ML Ed Solutions


Moving Forward — Learn more about Provectus AI expertise

  1. A Comparison of Large Language Models (LLMs) in Biomedical Domain
  2. An Instruction-following GPT-J Model Based on Instructions Generated by Bloom
  3. Exploring Intelligent Search Solutions: A Comparative Analysis of Amazon Kendra Integration and Large Language Model Crawlers
  4. Provectus AI review 1 — Google I/O 2023 Overview
  5. Provectus AI review 2 — The False Promise of Imitating Proprietary LLMs

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics