onnx for phi-3 mini
Shuaib11-Github opened this issue · comments
how to use onnx model for Phi-3 mini 128k for faster inference for local machine having cpu only. Can you provide the code to do it.
You can follow the tutorial and example code here for running Phi-3 mini 128K on CPU. In the tutorial, you can replace any references to Phi-3-mini-4k-instruct-onnx
with Phi-3-mini-128k-instruct-onnx
.
The example shows how to get the output tokens from the LLM with a specific system prompt. You can change the system prompt for your scenario. Then you can use the decoded output to decide which agent to pass the output to.
In your code, it appears that the ChatOpenAI
class is providing you a high-level view of the generation loop. If you want to use Phi-3 mini in your code example, you can replace
open_api_key = os.getenv("OPENAI_API_KEY")
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY")
llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0, openai_api_key=open_api_key)
prompt = "your prompt here"
response = llm.invoke(prompt)
with the equivalent ONNX Runtime GenAI code
import onnxruntime_genai as og
model = og.Model("/path/to/folder/containing/onnx/model/and/genai/config/json/file")
tokenizer = og.Tokenizer(model)
prompt = "your prompt here"
input_tokens = tokenizer.encode(prompt)
params = og.GeneratorParams(model)
params.input_ids = input_tokens
params.set_search_options(temperature=0)
output_tokens = model.generate(params)
response = tokenizer.decode(output_tokens)
The ONNX Runtime GenAI code gives you much more granular control over the different steps that are happening in a generation loop compared to your ChatOpenAI
class, which hides them. If you don't need the granular control, you can use the basic tokenizer.encode
, model.generate
, and tokenizer.decode
methods to get your response from the LLM.
If you want to have granular control, you can change how these methods are used. For example, you can use your own tokenizer to handle converting between text and token ids. This would mean you can avoid using ONNX Runtime GenAI's tokenizer.encode
and tokenizer.decode
methods. Another example is you can customize the generation loop instead of using the higher-level model.generate
method.
Only the part that you gave needs to be changed. And we need to download
the corresponding onnx model of phi-3 right
And also will it run on CPU machine.
The answer is yes to all of your questions. The Phi-3 mini 128K ONNX models are uploaded here. Here's an example of how you can download just the CPU model with accuracy level = 4 using the Hugging Face CLI.
# Install Hugging Face CLI
$ pip install huggingface_hub[cli]
# Download just the CPU model with accuracy level = 4 to a local directory named 'phi3-mini-128k-instruct-onnx'
$ huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir ./phi3-mini-128k-instruct-onnx --local-dir-use-symlinks False
Can you provide a basic example of how to use these with Agentic approach.
In your Python code, you can replace
open_api_key = os.getenv("OPENAI_API_KEY")
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY")
llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0, openai_api_key=open_api_key)
with
import onnxruntime_genai as og
model = og.Model("./phi3-mini-128k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/")
tokenizer = og.Tokenizer(model)
to initialize the Phi-3 mini model for CPU.
Then you can define the following function
def invoke(query):
input_tokens = tokenizer.encode(query)
params = og.GeneratorParams(model)
params.input_ids = input_tokens
params.set_search_options(temperature=0)
output_tokens = model.generate(params)
response = tokenizer.decode(output_tokens)
return response
and replace any references of response = llm.invoke(query)
in your code with response = invoke(query)
and any references of response = llm.invoke(prompt)
in your code with response = invoke(prompt)
.