This repository shows how to compile Foundation Models (FMs) such as Meta-Llama-3-8B-Instruct
available on the Hugging Face model hub for Neuron cores using neuron SDK 2.18.1. The compilation process depends on the value of environment variable NEURON_RT_NUM_CORES
.
-
The Neuron SDK requires that you compile the model on an Inferentia instance. So this code needs to be run on an
Inf2
EC2 instance. TheMeta-Llama-3-8B-Instruct
was compiled on aninf2.24xlarge
instance. -
Create an
Inf2
based EC2 instance.- Use the
Hugging Face Neuron Deep Learning AMI (Ubuntu 22.04)
AMI for your instance. - Use
inf2.24xlarge
ortrn1.32xlarge
as the instance type. - Have
AmazonSageMakerFullAccess
policy assigned to the IAM role associated with your EC2 instance. Add the following Trust Relationship added to the IAM role.{ "Effect": "Allow", "Principal": { "Service": "sagemaker.amazonaws.com" }, "Action": "sts:AssumeRole" }
- Use the
-
You need a valid Hugging Face token to download gated models from the Hugging Face model hub.
It is best to use
VSCode
to connect to your EC2 instance as we would be running the code from abash
shell.
-
Download and install Conda on your EC2 VM.
-
Clone this repo on the EC2 VM.
git clone https://github.com/aarora79/compile-llm-for-aws-silicon.git
-
Create a new conda environment for
Python 3.10
and install the packages listed inrequirements.txt
.conda create --name awschips_py310 -y python=3.10 ipykernel source activate awschips_py310; pip install -r requirements.txt
-
Change directory to the code repo directory.
-
Run the
download_compile_deploy.sh
script using the following command. This script will do a bunch of things:- Download the model from Hugging Face.
- Compile the model for Neuron.
- Upload the model files to S3.
- Create a
settings.properties
file that refers to the model in S3 and create amodel.tar.gz
with thesettings.properties
. - Deploy the model on a SageMaker endpoint.
# replace the model id, bucket name and role parameters as appropriate hf_token=<your-hf-token> model_id=meta-llama/Meta-Llama-3-8B-Instruct neuron_version=2.18 model_store=model_store s3_bucket="<your-s3-bucket>" s3_prefix=lmi region=us-east-1 batch_size=4 num_neuron_cores=8 ml_instance_type=ml.trn1.32xlarge role="arn:aws:iam::<your-account-id>:role/<your-role-name>" ./scripts/download_compile_deploy.sh $hf_token \ $model_id \ $neuron_version \ $model_store \ $s3_bucket \ $s3_prefix \ $region \ $role \ $batch_size \ $num_neuron_cores \ $ml_instance_type> script.log 2>&1
-
The model is deployed now, note the endpoint name from the SageMaker console and you can use it for testing inference via the SageMaker
invoke_endpoint
call as shown ininfer.py
included in this repo, and also, benchmarking performance via the Bring Your Own Endpoint option inFMBench
.
See CONTRIBUTING for more information.
This library is licensed under the MIT-0 License. See the LICENSE file.