raymondlo84 / llama2.openvino

This sample shows how to implement a llama-based model with OpenVINO runtime

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

llama2.openvino

This sample shows how to implement a llama-based model with OpenVINO runtime.

MicrosoftTeams-image (2)
  • Please follow the Licence on HuggingFace and get the approval from Meta before downloading llama checkpoints, for more information

  • Please notice this repository is only for a functional test and personal study, and you can try to quantize the model to further optimize the performance of it

System requirements

Description
RAM 128Gb +

Install the requirements

python3 -m venv openvino_env
source openvino_env/bin/activate
pip install -r requirements.txt

Deployment Method 1: OpenVINO IR pipeline, export IR model from HF Optimum-Intel

1. Run Optimum-Intel OpenVINO pipeline and export the IR model

python3 export_ir.py -m 'meta-llama/Llama-2-7b-hf' -o './ir_model'
cd ir_pipeline
python3 generate_op.py -m "meta-llama/Llama-2-7b-hf" -p "what is openvino?" -d "CPU"

2. (Optional) Run restructured pipeline:

    python3 generate_ir.py -m "meta-llama/Llama-2-7b-hf" -p "what is openvino?" -d "CPU"

Deployment Method 2: ONNX pipeline, export ONNX model from HF Optimum

  • Please notice the step below will lead to large memory consumption, you have to make sure your server should be with >256GB RAM for this step.

1. Export the ONNX model from HuggingFace Optimum and convert it to OpenVINO IR:

cd onnx_pipeline
optimum-cli export onnx --model meta-llama/Llama-2-7b-hf ./onnx_model/
mkdir ir_model
mo -m ./onnx_model/decoder_model_merged.onnx -o ./ir_model/ --compress_to_fp16
#cleanup (optional)
rm ./onnx_model/ -rf

2. Run restructured pipeline:

python3 generate_onnx.py -m  "meta-llama/Llama-2-7b-hf" -p "what is openvino?" -d "CPU"

Interactive demo

1. Run interactive Q&A demo with Gradio:

cd demo
python3 qa_gradio.py -m "meta-llama/Llama-2-7b-hf" 

2. or chatbot demo with Streamlit:

python3 export_ir.py -m 'meta-llama/Llama-2-7b-chat-hf' -o './ir_model_chat'
cd demo
streamlit run chat_streamlit.py

About

This sample shows how to implement a llama-based model with OpenVINO runtime


Languages

Language:Python 100.0%