Deploying Llama3 with Inference Endpoints and AWS Inferentia2
In this video, I walk you through the simple process of deploying a Meta Llama 3 8B model with Hugging Face Inference Endpoints and the AWS Inferentia2 accelerator.
I use the latest version of the Hugging Face Text Generation Inference containers (TGI 2.0), and show you how to run streaming inference with the OpenAI client library. I also discuss Inferentia2 benchmarks.
