Support scenarios where inputs_embeds is a model input
eres313 opened this issue · comments
Hi while converting the model with extra_options = {"exclude_embeds": 1}
I am getting below error while loading the model
# Inference
import onnxruntime_genai as og
import time
model = og.Model(output_dir)
RuntimeError: Error encountered while parsing './output_dir/genai_config.json' JSON Error: Unknown value: inputs_embeds at line 20 index 49
Here is the genai_config.json
file
{
"model": {
"bos_token_id": 1,
"context_length": 4096,
"decoder": {
"session_options": {
"log_id": "onnxruntime-genai",
"provider_options": [
{
"cuda": {
"enable_cuda_graph": "0"
}
}
]
},
"filename": "model.onnx",
"head_size": 96,
"hidden_size": 3072,
"inputs": {
"inputs_embeds": "inputs_embeds",
"attention_mask": "attention_mask",
"position_ids": "position_ids",
"past_key_names": "past_key_values.%d.key",
"past_value_names": "past_key_values.%d.value"
},
"outputs": {
"logits": "logits",
"present_key_names": "present.%d.key",
"present_value_names": "present.%d.value"
},
"num_attention_heads": 32,
"num_hidden_layers": 32,
"num_key_value_heads": 32
},
"eos_token_id": 32000,
"pad_token_id": 32000,
"type": "phi3",
"vocab_size": 32015
},
"search": {
"diversity_penalty": 0.0,
"do_sample": false,
"early_stopping": true,
"length_penalty": 1.0,
"max_length": 4096,
"min_length": 0,
"no_repeat_ngram_size": 0,
"num_beams": 1,
"num_return_sequences": 1,
"past_present_share_buffer": false,
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 1,
"top_p": 1.0
}
}
eres313 currently onnxruntime-genai
does not support executing a model with input embeddings as the input to a decoder only model.
Could you share some details around how you plan to retrieve the input embeddings for your use case before they can be passed into onnxruntime-genai as inputs to the model? Do you intend to do this outside of onnxruntime?
Yes, my input embeddings are fixed, I dont want to recompute them every time, so it will be outside the onnxruntime.
The model is taking input_embeds
and attention_mask
as inputs to generate the text.
also have the same issue while running the phi-3 vision onnx
henrywang0314 running phi-3 vision onnx model should not give you that error. What version of onnxruntime-genai are you using?
Could you share the output of pip list | grep onnxruntime-genai
if using linux?
Yes, my input embeddings are fixed, I dont want to recompute them every time, so it will be outside the onnxruntime.
OrkhanHI We can add some work in onnxruntime-genai to support scenarios where the input is inputs_embeds. For now, this is not supported.
Same issue with phi-3-vision when following the instructions here: https://onnxruntime.ai/docs/genai/tutorials/phi3-v.html
pip list | grep onnxruntime-genai
onnxruntime-genai 0.2.0
python phi3v.py -m cpu-int4-rtn-block-32-acc-level-4
Loading model...
Traceback (most recent call last):
File "/home/cbiebers/devel/onnx-phi-3-vision/phi3v.py", line 66, in
run(args)
File "/home/cbiebers/devel/onnx-phi-3-vision/phi3v.py", line 16, in run
model = og.Model(args.model_path)
RuntimeError: Error encountered while parsing 'cpu-int4-rtn-block-32-acc-level-4/genai_config.json' JSON Error: Unknown value: inputs_embeds at line 14 index 49
Please use the release candidate 0.3.0-rc2
Use --pre
in your pip install command.
pip install --pre onnxruntime-genai