This sample shows how to implement a llama-based model with OpenVINO runtime.
-
Please follow the Licence on HuggingFace and get the approval from Meta before downloading llama checkpoints, for more information
-
Please notice this repository is only for a functional test and personal study, and you can try to quantize the model to further optimize the performance of it
Description | |
---|---|
RAM | 128Gb + |
python3 -m venv openvino_env
source openvino_env/bin/activate
pip install -r requirements.txt
1. Run Optimum-Intel OpenVINO pipeline and export the IR model
python3 export_ir.py -m 'meta-llama/Llama-2-7b-hf' -o './ir_model'
cd ir_pipeline
python3 generate_op.py -m "meta-llama/Llama-2-7b-hf" -p "what is openvino?" -d "CPU"
2. (Optional) Run restructured pipeline:
python3 generate_ir.py -m "meta-llama/Llama-2-7b-hf" -p "what is openvino?" -d "CPU"
- Please notice the step below will lead to large memory consumption, you have to make sure your server should be with >256GB RAM for this step.
1. Export the ONNX model from HuggingFace Optimum and convert it to OpenVINO IR:
cd onnx_pipeline
optimum-cli export onnx --model meta-llama/Llama-2-7b-hf ./onnx_model/
mkdir ir_model
mo -m ./onnx_model/decoder_model_merged.onnx -o ./ir_model/ --compress_to_fp16
#cleanup (optional)
rm ./onnx_model/ -rf
2. Run restructured pipeline:
python3 generate_onnx.py -m "meta-llama/Llama-2-7b-hf" -p "what is openvino?" -d "CPU"
1. Run interactive Q&A demo with Gradio:
cd demo
python3 qa_gradio.py -m "meta-llama/Llama-2-7b-hf"
2. or chatbot demo with Streamlit:
python3 export_ir.py -m 'meta-llama/Llama-2-7b-chat-hf' -o './ir_model_chat'
cd demo
streamlit run chat_streamlit.py