Hugging Face on AWS

Train and deploy Hugging Face models in minutes with Amazon SageMaker, AWS Trainium, and AWS Inferentia

Overview

With Hugging Face on AWS, you can access, evaluate, customize, and deploy hundreds of publicly available foundation models (FMs) through Amazon SageMaker on NVIDIA GPUs, as well as purpose-built AI chips AWS Trainium and AWS Inferentia, in a matter of clicks. These easy-to-use flows which are supported on the most popular FMs in the Hugging Face model hub allow you to further optimize the performance of their models for their specific use cases while significantly lowering costs. Code snippets for Sagemaker are available on every model page on the model hub under the Train and Deploy dropdown menus.

Behind the scenes, these experiences are built on top of the Hugging Face AWS Deep Learning Containers (DLCs), which provide you a fully managed experience for building, training, and deploying state-of-the-art FMs using Amazon SageMaker. These DLCs remove the need to package dependencies and optimize your ML workload for the targeted hardware. For example, AWS and Hugging Face collaborate on the open-source Optimum Neuron library which is packaged in the DLCs built for AWS AI chips to deliver price performance benefits with minimal overhead.

Benefits

Hugging Face offers a wide array of pre-trained FMs such as Meta Llama 3, Mistral, Falcon 2, and Starcoder that you can securely access and deploy via Amazon SageMaker JumpStart on AWS Trainium, AWS Inferentia, and NVIDIA GPUs with just a few clicks. SageMaker also provides enhanced security by allowing you to use your virtual private cloud (VPC) and deploy FMs in network isolation.
Get high performance with the broadest set of accelerated EC2 instances and support for popular frameworks such as PyTorch, TensorFlow, and JAX. AWS Trainium can help you lower training costs by up to 50% and AWS Inferentia2 can lower inference costs by up to 40% over comparable EC2 instances.
Using Amazon SageMaker, you can customize publicly available models with advanced techniques to improve model quality for specific tasks and enable production workloads at scale. You can leverage techniques such as prompt engineering, retrieval augmented generation (RAG), and fine-tuning techniques including parameter efficient fine tuning (PEFT), low rank adaptation (LoRA), reinforcement learning with human feedback (RLHF), and supervised fine tuning (SFT).
Take advantage of Amazon SageMaker’s purpose-built tools for every step of the FM development lifecycle. With Amazon SageMaker, you can evaluate, deeply customize, and deploy models with optimized performance, latency, and cost. You can deploy FMs in real-time or asynchronously, and use multi-model endpoints and other advanced deployment techniques to have full control on cost and performance. Hugging Face Text Generation Inference (TGI), the advanced serving stack for deploying and serving large language models (LLMs), supports NVIDIA GPUs as well as Inferentia2 on SageMaker, so you can optimize for higher throughput and lower latency, while reducing costs.

Use cases

Content summarization

Produce concise summaries of articles, blog posts, and documents to identify the most important information, highlight key takeaways, and more quickly distill information. Hugging Face provides a variety of models for content summarization, including Meta Llama 3.

Chat support or virtual assistants

Streamline customer self-service processes and reduce operational costs by automating responses for customer service queries through generative AI-powered chat support and virtual assistants. Hugging Face provides models that can be used for chat support or virtual assistants, including instruction-tuned Meta Llama 3 and Falcon 2 models.

Content generation

Create personalized, engaging, and high-quality content, such as short stories, essays, blogs, social media posts, images, and web page copy. Hugging Face provides models for content generation, including Mistral.

Code generation

Accelerate application development with code suggestions. Hugging Face provides models that can be used code generation, including StarCoder.

Document vectorization

By vectorizing documents with embedding models, you unlock powerful capabilities for information retrieval, question answering, semantic search, contextual recommendations, and document clustering. These applications enhance the way users interact with information, making it easier to discover, explore, and leverage relevant knowledge from large document collections.

Videos

Deploying Hugging Face models with Amazon SageMaker and AWS Inferentia2
SageMaker JumpStart: deploy Hugging Face models in minutes!
Deep Dive: Hugging Face models on AWS AI Accelerators