Llama 2 Release, Hugging Face Updates, OpenAI Availability and Deprecation, and “Superalignment” Vision
Provectus AI review #4

Llama 2 Release, Hugging Face Updates, OpenAI Availability and Deprecation, and “Superalignment” Vision

Welcome to the latest edition of the “Provectus AI Review” series! 

Over the last couple of weeks, the rapid growth in the AI/ML niche has been nothing short of remarkable. Even more notable is the astounding progress of Generative AI and Large Language Models (LLMs) – a wave of innovation that is fundamentally reshaping our understanding of what's possible.

This issue highlights a few topics that are already shaping the future of AI, including the debut of Llama 2, the latest updates from Hugging Face, an in-depth analysis of OpenAI's deprecation policies, and the concept of Superalignment.

Meta AI’s Llama 2 Is Amazing

Meta has just unveiled Llama 2, the latest open-source Large Language Model (LLM) that raises the bar for state-of-the-art AI. Building upon its predecessor, this second iteration of Llama comes with a commercial-friendly license, offering substantial benefits for enterprise use.

No alt text provided for this image

Llama 2 is available in three distinct sizes — 7 billion (7B), 13 billion (13B), and 70 billion (70B) parameters. The 7B and 13B variants maintain the same architectural design as the original Llama and serve as direct substitutes for commercial applications.

There are a few enhancements to Llama 2 that set it apart from the original version. Notably, it has been trained on an impressive 2 trillion tokens. This version also permits commercial usage, a pivotal improvement that broadens its applicability. Special attention has been given to enhancing its capabilities for dialogue use cases with dedicated chat models. 

Llama 2 boasts a default context window of 4096 tokens (which can be increased per user requirements). The 70B parameter version of Llama 2 has also incorporated grouped-query attention (GQA) to further augment its functionality.

In terms of accessibility, the chat models of Llama 2 can effectively utilize various tools and plugins, offering users greater versatility. 

Note: Llama 2-CHAT performs on par with OpenAI's ChatGPT, illustrating the competitive potential of this new model. Both Llama 2 and its chat variant are readily accessible on platforms such as Hugging Face and Meta.

No alt text provided for this image

More tokens with RoPE scaling

Large Language Models like Llama and GPTNeoX place restrictions on the number of tokens that can be used in one go. This boundary has been pushed with the introduction of RoPE scaling.

With RoPE scaling, these models can handle inputs of arbitrary length, extending beyond the previous cap of 4,000 tokens. The implementation is simple yet effective - just pass `rope_scaling` at the time of model loading and your model is immediately empowered to manage extended inputs. This extended capability comes without the need for additional fine-tuning.

However, there is a minor caveat. While RoPE scaling does allow for expanded inputs, there can be an increase in perplexity with lengthening of the input. This can be addressed effectively with targeted fine-tuning. As such, RoPE scaling presents an exciting new development for users to manage longer sequences with Llama and GPTNeoX models.

4-bit quantization

4-bit quantization is a promising approach, offering two distinct alternatives for using the HuggingFace platform with Llama 2 — dynamic and static quantization.

Dynamic quantization provides real-time quantization by loading original weights. This process is relatively straightforward: passing `device_map="auto", load_in_4bit=True` into the `from_pretrained()` function activates this feature, setting you up for an immediate start.

Static quantization should be your choice if you are looking to boost the speed of your local Llama 2 model. TheBloke offers quantized checkpoints on the hub, explaining how this can be done. In short, checkpoints can be used with AutoGPTQ and the HuggingFace ecosystem, simplifying their use. 

Check out an exemplary script that can be adapted specifically for Llama 2.

Deployment and fine-tuning of Llama 2

The deployment and fine-tuning of Llama 2 bring a unique blend of challenges and opportunities. Here, we explore three potential scenarios that demonstrate how these processes can be effectively executed.

  1. Llama recipes: Meta's official 'llama-recipes' repository, designed as a companion to the Llama 2 model, offers comprehensive examples for efficient fine-tuning and inference of these fine-tuned models. It provides in-depth guidance and code examples for various scenarios, including single GPU fine-tuning, multi-GPU fine-tuning, LLM fine-tuning, adding custom datasets, and conducting inference.
  2. Hugging Face Transformers for Llama 2: The Hugging Face Transformers library provides an in-depth guide for deploying and using the Llama 2 model via text-generation-inference endpoints. It outlines an efficient methodology for fine-tuning Llama 2 on simple hardware, demonstrating how to adjust the 7 billion parameter version on a single NVIDIA T4 GPU, available on Google Colab. To assist in this process, a specialized script is offered, aiding users in customizing the Llama 2 model for optimized performance.
  3. QLoRA and Hugging Face Transformers on Amazon SageMaker: Using the Hugging Face Transformers library and QLoRA on Amazon SageMaker, you can efficiently fine-tune Llama version 2 through a process known as Partially Elastic Fine-Tuning (PEFT). This starts with setting up your development environment, followed by loading and preparing your dataset to suit your specific use case. The key phase involves fine-tuning the 13 billion parameter Llama model on Amazon SageMaker with the help of QLoRA. The process culminates in deploying the fine-tuned Large Language Model on Amazon SageMaker, ready to deliver powerful AI solutions in real-world applications.

Hugging Face’s New PEFT Library Is Released

Hugging Face has released an update of its PEFT library, boasting several significant enhancements. One of the key updates is the official integration of the QLoRA method, enabling the training of adapters on top of a 4-bit base model, which we partially covered above.

No alt text provided for this image

The new release introduces an additional PEFT task, feature extraction, and adds PeftModelForQuestionAnswering and PeftModelForFeatureExtraction classes. These inclusions facilitate support for question answering and feature extraction tasks respectively, allowing PEFT to be used for diverse tasks like semantic similarity.

“Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning” paper

A new adapter method, IA3 (Infused Adapter by Inhibiting and Amplifying Inner Activations), is introduced from the T-Few Paper. IA3 optimizes fine-tuning efficiency by rescaling inner activations with learned vectors, which are then infused into the attention and feedforward modules in a typical transformer-based architecture. These vectors are the only parameters that are trainable during fine-tuning, leaving the original weights untouched. This approach ensures a reduced count of trainable parameters.

The release also features 'auto mapping,' a new paradigm allowing users to load PEFT models with a single line of code through the AutoPeftModelForxxx. Furthermore, users can now apply LoRA to any torch model and upload the adapter weights to the Hub, thanks to PEFT's enhanced flexibility.

The PEFT library now supports Stable Diffusion, being highly extensible and user-friendly for performing tasks like DreamBooth. The community has contributed conversion scripts for PEFT models to be used interchangeably in the Civitai/webui format. With these advancements, the PEFT library further solidifies its standing as a robust tool for AI model fine-tuning.

General Availability of GPT-4 API Is Announced

OpenAI has made its GPT-4 API generally available, while also deprecating older models in the Completions API. They also introduced a new deprecation policy that may impact those who are employing OpenAI's models in production.

No alt text provided for this image

Users will now be required to test the new models, retrain any fine-tuned versions, and regenerate any embeddings they are currently using. 

This shift emphasizes a considerable advantage of open-source models: the flexibility and control they offer. With open-source models, users have the freedom to continue using them indefinitely and can decide when an upgrade suits their needs best.

Claude 2 from Anthropic Is Here to Rival ChatGPT

Anthropic has introduced its second version of its LLM-powered tool, Claude 2. The model was trained on data up to early 2023, ensuring it is updated with the latest contextual nuances and semantic developments.

No alt text provided for this image

Claude 2 enhances the context length capacity up to an impressive 200,000 tokens, although initially, 100,000 tokens will be available at launch. This extended capacity allows for more nuanced, detailed, and contextually aware responses, greatly enhancing its utility across various applications.

Another noteworthy feature of Claude 2 is its ability to upload documents. This feature opens up a myriad of possibilities for users to feed structured and unstructured data directly into the model, enabling it to provide insights, summaries, or other actionable outputs based on the document contents.

In terms of API structure, Claude 2 has been designed to follow better practices. This leads to a more streamlined, user-friendly experience when integrating the model into various platforms or applications, increasing its usability and interoperability.

Claude 2 boasts advanced coding abilities. Its model can understand, generate, and manipulate code more efficiently, making it a potent tool for software development tasks, code review, and even tutoring in various programming languages.

To top it all, Anthropic is offering a free beta version of Claude 2 for public use. This offers a fantastic opportunity for developers, researchers, and AI enthusiasts to test the model's capabilities and see firsthand the advancements that have been made in this iteration.

New Keras for TensorFlow, JAX, and PyTorch

François Chollet, a leading artificial intelligence researcher, has introduced a pivotal update, Keras 3.0, now referred to as Keras "Core." This update positions Keras as a frontend for TensorFlow, JAX, and even PyTorch, marking the first time all three frameworks have been unified under a single frontend.

No alt text provided for this image

This unification implies that developers can utilize either the functional API or the sequential approach across all backends, opening up an unprecedented level of flexibility in the design and execution of machine learning models.

One of the highlights of Keras 3.0 is its portability. For example, you can now train a model in PyTorch and subsequently run it in JAX, all without making any modifications to your code. 

Although there are a few considerations to note, this level of interoperability extends even to different hardware configurations. For instance, a model trained on a GPU using PyTorch should, in theory, be runnable on a TPU using JAX.

Keras 3.0 also excels in minimizing technical debt. Chollet pointed out that the codebase of this latest release is only 30% the size of Keras 2.0, as the project has been entirely rewritten from scratch. This re-engineering effort not only eliminates legacy issues but also ensures the codebase is more maintainable and future-proof, contributing to a more robust and efficient ecosystem for deep learning development.

Extending Context Length with RoPE Scaling

As previously mentioned, Hugging Face Transformers library now includes support for Rotary Position Embeddings (RoPE) scaling, an approach that extends the context length of large language models such as Llama, GPT-NeoX, and Falcon. Users can enable this by adjusting a specific parameter in the corresponding model's configuration.

The RoPE scaling technique, initially proposed by u/emozilla on Reddit, allows for the dynamic interpolation of rotary position embeddings to depict lengthier sequences without compromising performance. It enables scaling out the context length of models without the need for fine-tuning.

No alt text provided for this image

Although the technique provides satisfactory results out of the box, it has been found that performance can be further enhanced by additional fine-tuning. With the integration of RoPE scaling, businesses now have the opportunity to effortlessly expand open-source Large Language Models to context lengths tailored for their specific use cases, offering a significant stride in enhancing the adaptability of such models.

Lost in the Middle: How Language Models Use Long Contexts

A team of Stanford researchers recently embarked on a study to explore how Large Language Models (LLMs) make use of context, with a special focus on longer contexts exceeding 32,000 tokens. The goal was to gain a better understanding of how these models handle information in large token windows.

In the course of their study, the researchers tested both open-source models, such as MPT-30B-Instruct and LongChat-13B(16K), and closed-source models, including OpenAI's GPT-3.5-Turbo and Anthropic's Claude 1.3. They used multi-document question-answering setups where the context encompassed several retrieved documents along with a single correct answer, the position of which was randomly shuffled.

The team also employed key-value pair retrieval methods to analyze if performance was affected by longer contexts. This approach provided insights into how LLMs operate with varying levels of context length, and the results could potentially inform future development and tuning of such models.

Major takeaways from the study:

  • Best Performance occurs when relevant information is at the beginning
  • Performance decreases with increases in context length
  • Too many retrieved documents harm performance
  • Improving the retrieval and prompt creation step with Cross-Encoders (ranking) could potentially boost performance by up to 20%
  • Combining Retrieval with Ranking should yield the best performance in RAG for Question Answering
  • Extended-context models (GPT-3.5-Turbo vs. GPT-3.5-Turbo (16K)) are not better if the prompt fits the original context

OpenAI Introduces Superalignment

OpenAI has recently announced an ambitious new initiative called "Superalignment." The goal of this project is to foster scientific and technical breakthroughs that will enable smarter steering and control of AI systems, particularly those significantly more advanced. The initiative is set to run over the next four years, co-led by Ilya Sutskever and Jan Leike.

No alt text provided for this image

OpenAI is dedicating 20% of its current computational resources to their work on "Superalignment." The ultimate objective is the creation of an automated alignment researcher, operating at a human level of expertise. By harnessing vast computational power, OpenAI aims to iteratively enhance the alignment of superintelligent AI systems, scaling efforts in a way that not only amplifies AI capabilities, but also ensures they remain under effective human control.

OpenLLM

OpenLLM is a tool designed to empower the deployment and utilization of any open-source Large Language Models (LLMs). Whether in the cloud or on-premises, OpenLLM's objective is to enable the development of AI applications with ease.

No alt text provided for this image

The platform offers extensive support for various open-source LLMs and model runtimes, such as StableLM, Falcon, Dolly, Flan-T5, ChatGLM, StarCoder, and more. It streamlines the process of serving LLMs over RESTful API or gRPC, enabling queries via a user-friendly WebUI, CLI, Python/Javascript clients, or any HTTP client.

OpenLLM's first-class support for LangChain, BentoML, and Hugging Face allows users to effortlessly create Generative AI applications. This is made possible by enabling the integration of LLMs with other models and services. 

With OpenLLM, users can also generate Docker Images for their LLM server, or deploy as a serverless endpoint via BentoCloud, thereby enhancing the portability and accessibility of their AI services.

GPT Prompt Engineer

GPT Prompt Engineer is a tool designed to harness the capabilities of GPT-4 and GPT-3.5-Turbo to automatically identify the most effective prompts for a given task. This is accomplished by feeding a task description and test cases into the system. In response, it generates an extensive selection of prompts, each of which is then evaluated based on the provided test cases. 

To rank the prompts according to their effectiveness, the GPT Prompt Engineer utilizes an ELO rating system. Initially, each prompt is assigned a rating of 1200. As the prompts are tested, this rating is adjusted according to their relative performance. This approach offers users a straightforward method for identifying the most efficacious prompts for their specific task. 

Further enhancing its capabilities, the GPT Prompt Engineer provides an optional feature allowing users to log such configurations as temperature and max tokens, system and user prompts, test cases, and the resultant ELO ratings. These logs can be accessed on the Weights & Biases platform for future reference and analysis.

GPT Researcher

GPT Researcher is an autonomous agent that can help to undertake extensive online research on various tasks. This robust tool is capable of generating intricate, objective, and fact-based research reports. This is achieved by utilizing an innovative design inspired by AutoGPT and the recent Plan-and-Solve paper. 

To better accommodate user needs, the GPT Researcher allows customization in directing focus towards pertinent resources, crafting outlines, and synthesizing lessons from the acquired data. 

GPT Researcher addresses key issues concerning speed and determinism that are often present in AI research tools. By employing parallelized operations, as opposed to synchronous methods, it offers a markedly stable performance, along with a substantial increase in speed. 

No alt text provided for this image

The goal of the GPT Researcher is to equip individuals and organizations with precise, impartial, and factual information. This is achieved by effectively leveraging the power of artificial intelligence, ultimately contributing to more informed decision-making processes.


In this edition of the “Provectus AI Review,” we covered the latest releases, updates, instruments, and trends in AI. 

From our analysis of Llama 2 and exploration of Hugging Face, to understanding OpenAI's new policies and the fascinating concept of Superalignment, we hope we have provided a rich perspective of the AI landscape. 

The Provectus team will continue to provide thorough and up-to-date reviews of the latest in AI, keeping you informed about new transformative developments molding the world of AI.


Author: Marlon Cajamarca Vega — Machine Learning Engineer & AI Educator || Provectus || ML Ed Solutions


Moving Forward — Learn more about Provectus AI expertise

  1. A Comparison of Large Language Models (LLMs) in Biomedical Domain
  2. An Instruction-following GPT-J Model Based on Instructions Generated by Bloom
  3. Exploring Intelligent Search Solutions: A Comparative Analysis of Amazon Kendra Integration and Large Language Model Crawlers
  4. Provectus AI review 1 — Google I/O 2023 Overview
  5. Provectus AI review 2 — The False Promise of Imitating Proprietary LLMs
  6. Provectus AI review 3 — Progress in Gen AI and Open-Source LLMs, New Product Launches, and Educational Resources

#artificialintelligence #machinelearning #generativeai #aiadoption #aitransformation #ainews #aiupdates

To view or add a comment, sign in

Insights from the community

Others also viewed

Explore topics