onnx for phi-3 mini

Question

onnx for phi-3 mini

Shuaib11-Github opened this issue 5 months ago · comments

Mohammed Shuaib Iqbal commented 5 months ago

how to use onnx model for Phi-3 mini 128k for faster inference for local machine having cpu only. Can you provide the code to do it.

kunal-vaishnavi · Answer 1 · Thu May 30 2024 09:25:35 GMT+0800 (China Standard Time)

You can follow the tutorial and example code here for running Phi-3 mini 128K on CPU. In the tutorial, you can replace any references to Phi-3-mini-4k-instruct-onnx with Phi-3-mini-128k-instruct-onnx.

Mohammed Shuaib Iqbal · Answer 2 · Fri May 31 2024 19:16:08 GMT+0800 (China Standard Time)

But i want to use the model tk build an LLM application like resume points generation which is an agentic code. Where I have 4 agents, given a prompt the LLM generates response and pass the necessary information to the respective agent. How can i do that. What about this approach of onnx is it faster in CPU machine for Inferencing.

…

On Thu, 30 May, 2024, 6:55 am kunal-vaishnavi, ***@***.***> wrote: Closed #530 <#530> as completed. — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AONNA2CQZAWFT7RBI6JEYSLZEZ52NAVCNFSM6AAAAABINOHWDWVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSHE3TQOBRGMZTOOI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

kunal-vaishnavi · Answer 3 · Tue Jun 04 2024 12:52:24 GMT+0800 (China Standard Time)

The example shows how to get the output tokens from the LLM with a specific system prompt. You can change the system prompt for your scenario. Then you can use the decoded output to decide which agent to pass the output to.

Mohammed Shuaib Iqbal · Answer 4 · Thu Jun 06 2024 04:30:33 GMT+0800 (China Standard Time)

In my code I am not using any tokenizer it is directly calling openai model and generating the output. This is my code import streamlit as st import os from dotenv import load_dotenv from langchain_openai import ChatOpenAI from langgraph.graph import Graph import langsmith from langsmith import traceable # Load the stored environment variables load_dotenv() # Set up OpenAI API key open_api_key = os.getenv("OPENAI_API_KEY") os.environ['LANGCHAIN_TRACING_V2'] = 'true' os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY") llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0, openai_api_key=open_api_key) # Streamlit page configuration st.set_page_config(page_icon="💬", layout="wide", page_title="ATS-Friendly Resume Points Generator") def icon(emoji: str): """Shows an emoji as a Notion-style page icon.""" st.write( f'<span style="font-size: 78px; line-height: 1">{emoji}</span>', unsafe_allow_html=True, ) icon("💼") st.title("ATS-Friendly Resume Points Generator") @Traceable def skills_extraction(args): job_description = args["job_description"] complete_query = f""" S: You are an advanced language model designed to extract specific information from {job_description}. T: Extract only specific technical skills, specific services, and specific cloud services from the given job description. Ensure strict exclusion of: - Any General cloud platforms - Any Broad concepts or terms - Any Methodologies - Any Processes - Any Architectural knowledge or designs - Any Architectural styles - Any Data intelligence and governance platforms - Any Collaboration and productivity tools - Any Project management tools - Any Soft skills - Any Automation workflows - Any Infrastructure as Code - Any QA processes - Any Non-technical tools A: The output should list only specific technical programming languages, frameworks, libraries, and cloud services relevant to the job description, with high accuracy. Each item must clearly fall into one of these categories and must not be related to any excluded categories. R: Ensure that the extraction focuses solely on the specified categories and strictly excludes any items related to the mentioned exclusions. General cloud platforms such as AWS, Azure, etc., should not be included unless specific services within these platforms are mentioned. The result should be a concise and accurate list of specific technical skills, specific services, and specific cloud services, without any general cloud platforms or broad terms. """ response = llm.invoke(complete_query) extracted_skills = response if isinstance(response, str) else response.get('content', '') if isinstance(response, dict) else getattr(response, 'content', '') print("Extracted technical skills:\n", extracted_skills) if "no skills extracted" in extracted_skills.lower() or extracted_skills.strip() == "": return {"error": "No relevant skills extracted"} return {"job_description": job_description, "extracted_skills": extracted_skills} @Traceable def validate_technical_skills(args): if "error" in args: return args job_description = args["job_description"] extracted_skills = args["extracted_skills"] validation_query = f""" S: You are an expert in validating technical content extracted from {job_description}. T: Validate the {extracted_skills} to ensure they meet specific criteria. A: Provide a single list categorizing each validated skill, service, and cloud service (e.g., EC2, SageMaker). Strictly exclude items related to the below categories: - General cloud platforms (e.g., AWS, Azure, GCP) except specific cloud services - Professional roles (e.g., Developer, Engineer, Manager) - Programming concepts (e.g., OOP, Multithreading) - Broad concepts (e.g., Microservices, SOA) - Broad terms (e.g., Development, Testing) - Architectural knowledge (e.g., Monolithic, Microservices) - Technical processes (e.g., CI/CD, Version Control) - Concept/Architectural styles (e.g., RESTful, Event-Driven) - Soft skills (e.g., Communication, Leadership) - Processes (e.g., Agile, Scrum) - Methodologies (e.g., Waterfall, Kanban) - DevOps practices (e.g., Continuous Integration, Continuous Deployment) - Automation workflows (e.g., Jenkins pipelines) - Software engineering practices (e.g., TDD, BDD) - Infrastructure as Code (IaC) (e.g., Terraform, CloudFormation) - QA processes (e.g., Manual Testing, Automated Testing) - Non-technical tools (e.g., Office Suite, Zoom) R: Focus strictly on specific programming languages, tools, frameworks, libraries, and services. Ensure each validated item is clearly categorized and excludes the listed categories. Final Validated Skills after proper inspection and excluding the above-mentioned categories: Skill Name: [Technical skill or service] Categories: [Be Specific and Recognize all relevant categories that it belongs to] """ response = llm.invoke(validation_query) validated_skills = response if isinstance(response, str) else response.get('content', '') if isinstance(response, dict) else getattr(response, 'content', '') print("Validate_technical_skills:\n", validated_skills) if "no skills extracted" in validated_skills.lower() or validated_skills.strip() == "": return {"error": "No relevant skills validated"} return {"job_description": job_description, "extracted_skills": extracted_skills, "validated_skills": validated_skills} @Traceable def resume_point_generation(args): if "error" in args: return args["error"] job_description = args["job_description"] validated_skills = args["validated_skills"] prompt = f""" S: You're a resume writer tasked with generating points for a candidate's resume, strictly focusing only on {validated_skills}. T: Generate a high number of high-quality and ATS-friendly resume points that integrate multiple validated technical skills smartly, without mentioning numerical improvements or percentages. A: Strictly follow these instructions: - Focused Skills: Use only {validated_skills}. - Integrated Skills: Each point must combine multiple validated skills. - Skill Repetition: Ensure each skill appears multiple times across the points. - Format: Follow the provided example format. - Comprehensive: Cover all validated skills with ample and varied points. - Smart Integration: Combine skills intelligently for highly detailed, high-quality points. - Quantity and Quality: Generate a high number of points to ensure comprehensive coverage of all validated skills. - Action Verbs: Begin each point with an action verb (e.g., Developed, Optimized). - Context: Describe the environment where the skill was applied. - Purpose: Highlight the outcome or purpose without using numerical values. R: Only include key technical skills: {validated_skills}. # Examples of good quality points: - Developed a real-time notification system using Socket.io in a Node.js and Express environment, ensuring efficient and low-latency communication between the server and clients, which improved user engagement and responsiveness. - Optimized database performance and query efficiency by implementing indexing, partitioning, and query refactoring in PostgreSQL, and utilized Redis for efficient caching of frequently accessed data, resulting in significantly faster data retrieval and processing. resume points: """ response = llm.invoke(prompt) resume_points = response if isinstance(response, str) else response.get('content', '') if isinstance(response, dict) else getattr(response, 'content', '') if "no resume points generated" in resume_points.lower() or resume_points.strip() == "": return "I don't have knowledge w.r.t to the context asked." return {"job_description": job_description, "extracted_skills": args["extracted_skills"], "validated_skills": validated_skills, "resume_points": resume_points} @Traceable def validate_resume_points(args): if "error" in args: return args["error"] resume_points = args["resume_points"] quality_check_prompt = f""" S: You are an expert in assessing the quality of resume points. T: Assess and categorize the following resume points into one of the four categories: good, better, medium, or low. For each point, also provide a detailed explanation of why it was categorized in that specific category. A: Provide a clear and concise categorization for each resume point. Ensure that the assessment is based on the following criteria: - Relevance: How well the point reflects the job requirements. - Clarity: How clearly the point conveys the candidate's skills and experience. - Detail: The level of detail and specificity in the point. - Integration: How well multiple skills are integrated within each point. R: Provide a categorized list of resume points. # Resume Points to Assess: {resume_points} """ response = llm.invoke(quality_check_prompt) categorized_points = response if isinstance(response, str) else response.get('content', '') if isinstance(response, dict) else getattr(response, 'content', '') if "no valid categorization" in categorized_points.lower() or categorized_points.strip() == "": return "Failed to categorize resume points." return {"job_description": args["job_description"], "extracted_skills": args["extracted_skills"], "validated_skills": args["validated_sk

…

On Tue, 4 Jun 2024, 10:22 am kunal-vaishnavi, ***@***.***> wrote: The example shows how to get the output tokens from the LLM with a specific system prompt <https://github.com/microsoft/onnxruntime-genai/blob/7caab968d66f05a091772129827bac25fcd04082/examples/python/phi3-qa.py#L24>. You can change the system prompt <https://huggingface.co/microsoft/Phi-3-mini-128k-instruct#chat-format> for your scenario. Then you can use the decoded output to decide which agent to pass the output to. — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AONNA2DKK5PMZZJQR6NIMPLZFVBZ3AVCNFSM6AAAAABINOHWDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBWGYYDONRXGQ> . You are receiving this because you authored the thread.Message ID: ***@***.***>

kunal-vaishnavi · Answer 5 · Thu Jun 06 2024 05:34:05 GMT+0800 (China Standard Time)

In your code, it appears that the ChatOpenAI class is providing you a high-level view of the generation loop. If you want to use Phi-3 mini in your code example, you can replace

open_api_key = os.getenv("OPENAI_API_KEY")
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY")
llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0, openai_api_key=open_api_key)

prompt = "your prompt here"
response = llm.invoke(prompt)

with the equivalent ONNX Runtime GenAI code

import onnxruntime_genai as og

model = og.Model("/path/to/folder/containing/onnx/model/and/genai/config/json/file")
tokenizer = og.Tokenizer(model)

prompt = "your prompt here"
input_tokens = tokenizer.encode(prompt)

params = og.GeneratorParams(model)
params.input_ids = input_tokens
params.set_search_options(temperature=0)

output_tokens = model.generate(params)
response = tokenizer.decode(output_tokens)

The ONNX Runtime GenAI code gives you much more granular control over the different steps that are happening in a generation loop compared to your ChatOpenAI class, which hides them. If you don't need the granular control, you can use the basic tokenizer.encode, model.generate, and tokenizer.decode methods to get your response from the LLM.

If you want to have granular control, you can change how these methods are used. For example, you can use your own tokenizer to handle converting between text and token ids. This would mean you can avoid using ONNX Runtime GenAI's tokenizer.encode and tokenizer.decode methods. Another example is you can customize the generation loop instead of using the higher-level model.generate method.

Mohammed Shuaib Iqbal · Answer 6 · Mon Jun 10 2024 03:53:57 GMT+0800 (China Standard Time)

Only the part that you gave needs to be changed. And we need to download the corresponding onnx model of phi-3 right And also will it run on CPU machine. Can you provide a basic example of how to use these with Agentic approach.

…

On Thu, 6 Jun 2024, 3:04 am kunal-vaishnavi, ***@***.***> wrote: In your code, it appears that the ChatOpenAI class is providing you a high-level view of the generation loop. If you want to use Phi-3 mini in your code example, you can replace open_api_key = os.getenv("OPENAI_API_KEY") os.environ['LANGCHAIN_TRACING_V2'] = 'true' os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY") llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0, openai_api_key=open_api_key) prompt = "your prompt here" response = llm.invoke(prompt) with the equivalent ONNX Runtime GenAI code import onnxruntime_genai as og model = og.Model("/path/to/folder/containing/onnx/model/and/genai/config/json/file") tokenizer = og.Tokenizer(model) prompt = "your prompt here" input_tokens = tokenizer.encode(prompt) params = og.GeneratorParams(model) params.input_ids = input_tokens params.set_search_options(temperature=0) output_tokens = model.generate(params) response = tokenizer.decode(output_tokens) The ONNX Runtime GenAI code gives you much more granular control over the different steps that are happening in a generation loop compared to your ChatOpenAI class, which hides them. If you don't need the granular control, you can use the basic tokenizer.encode, model.generate, and tokenizer.decode methods to get your response from the LLM. If you want to have granular control, you can change how these methods are used. For example, you can use your own tokenizer to handle converting between text and token ids. This would mean you can avoid using ONNX Runtime GenAI's tokenizer.encode and tokenizer.decode methods. Another example is you can customize the generation loop <https://github.com/microsoft/onnxruntime-genai/blob/a6bec83bf02667f6a9c6aa6340842f46e001578c/examples/python/phi3-qa.py#L55-L61> instead of using the higher-level model.generate method. — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AONNA2A6A2E3MGINIHVE3QDZF576FAVCNFSM6AAAAABINOHWDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJQHE4TSOJUGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

kunal-vaishnavi · Answer 7 · Wed Jun 12 2024 09:48:17 GMT+0800 (China Standard Time)

Only the part that you gave needs to be changed. And we need to download
the corresponding onnx model of phi-3 right

And also will it run on CPU machine.

The answer is yes to all of your questions. The Phi-3 mini 128K ONNX models are uploaded here. Here's an example of how you can download just the CPU model with accuracy level = 4 using the Hugging Face CLI.

# Install Hugging Face CLI
$ pip install huggingface_hub[cli]

# Download just the CPU model with accuracy level = 4 to a local directory named 'phi3-mini-128k-instruct-onnx'
$ huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir ./phi3-mini-128k-instruct-onnx  --local-dir-use-symlinks False

Can you provide a basic example of how to use these with Agentic approach.

In your Python code, you can replace

open_api_key = os.getenv("OPENAI_API_KEY")
os.environ['LANGCHAIN_TRACING_V2'] = 'true'
os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY")
llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0, openai_api_key=open_api_key)

with

import onnxruntime_genai as og

model = og.Model("./phi3-mini-128k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/")
tokenizer = og.Tokenizer(model)

to initialize the Phi-3 mini model for CPU.

Then you can define the following function

def invoke(query):
    input_tokens = tokenizer.encode(query)

    params = og.GeneratorParams(model)
    params.input_ids = input_tokens
    params.set_search_options(temperature=0)

    output_tokens = model.generate(params)
    response = tokenizer.decode(output_tokens)
    return response

and replace any references of response = llm.invoke(query) in your code with response = invoke(query) and any references of response = llm.invoke(prompt) in your code with response = invoke(prompt).

Mohammed Shuaib Iqbal · Answer 8 · Thu Jun 13 2024 00:03:20 GMT+0800 (China Standard Time)

Thanks, I'll check them out. And if there is any error I will message you

…

On Wed, 12 Jun, 2024, 7:18 am kunal-vaishnavi, ***@***.***> wrote: Only the part that you gave needs to be changed. And we need to download the corresponding onnx model of phi-3 right And also will it run on CPU machine. The answer is yes to all of your questions. The Phi-3 mini 128K ONNX models are uploaded here <https://huggingface.co/microsoft/Phi-3-mini-128k-instruct-onnx>. Here's an example of how you can download just the CPU model with accuracy level = 4 using the Hugging Face CLI. # Install Hugging Face CLI $ pip install huggingface_hub[cli] # Download just the CPU model with accuracy level = 4 to a local directory named 'phi3-mini-128k-instruct-onnx' $ huggingface-cli download microsoft/Phi-3-mini-128k-instruct-onnx --include cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/* --local-dir ./phi3-mini-128k-instruct-onnx --local-dir-use-symlinks False Can you provide a basic example of how to use these with Agentic approach. In your Python code, you can replace open_api_key = os.getenv("OPENAI_API_KEY") os.environ['LANGCHAIN_TRACING_V2'] = 'true' os.environ['LANGCHAIN_API_KEY'] = os.getenv("LANGCHAIN_API_KEY") llm = ChatOpenAI(model_name="gpt-4-0125-preview", temperature=0, openai_api_key=open_api_key) with import onnxruntime_genai as og model = og.Model("./phi3-mini-128k-instruct-onnx/cpu_and_mobile/cpu-int4-rtn-block-32-acc-level-4/") tokenizer = og.Tokenizer(model) to initialize the Phi-3 mini model for CPU. Then you can define the following function def invoke(model, query): input_tokens = tokenizer.encode(query) params = og.GeneratorParams(model) params.input_ids = input_tokens params.set_search_options(temperature=0) output_tokens = model.generate(params) response = tokenizer.decode(output_tokens) return response and replace any references of response = llm.invoke(query) in your code with response = invoke(model, query) and any references of response = llm.invoke(prompt) in your code with response = invoke(model, prompt). — Reply to this email directly, view it on GitHub <#530 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AONNA2HG26XQVT4NEHZM6SLZG6SHPAVCNFSM6AAAAABINOHWDWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRHEZDQMJVG4> . You are receiving this because you authored the thread.Message ID: ***@***.***>