Deploying Llama3 with Inference Endpoints and AWS Inferentia2

May 24, 2024

In this video, I walk you through the simple process of deploying a Meta Llama 3 8B model with Hugging Face Inference Endpoints and the AWS Inferentia2 accelerator.

I use the latest version of the Hugging Face Text Generation Inference containers (TGI 2.0), and show you how to run streaming inference with the OpenAI client library. I also discuss Inferentia2 benchmarks.

Julien’s Newsletter

Ready for more?