huggingface / optimum

System Info

Optimum version: d87efb2
Transformers version: d479665
ONNXRuntime version: 1.17.1
ONNX version: 1.15.0

Who can help?

@michaelbenayoun @echarlaix

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

I saw google/gemma-2b (and flavors) support was added in #1714.

Repro:
optimum-cli export onnx --model google/gemma-2b-it --opset 17 --framework pt ./gemma.

import torch
import onnxruntime as ort
from onnxruntime_extensions import get_library_path
from transformers import AutoModelForCausalLM

hf_model = AutoModelForCausalLM.from_pretrained('google/gemma-2b-it')

so = ort.SessionOptions()
so.register_custom_ops_library(get_library_path()) # just in case
ort_model = ort.InferenceSession('/gemma/path/model.onnx', so)

hf_inputs = { 'input_ids': torch.LongTensor([[1,2,3,4,5]]), 'attention_mask': torch.Tensor([[1,1,1,1,1]]), 'position_ids': torch.LongTensor([[0,1,2,3,4]]) }
hf_logits = hf_model(**hf_inputs)['logits']

ort_inputs = { 'input_ids': [[1,2,3,4,5]], 'attention_mask': [[1,1,1,1,1]], 'position_ids': [[0,1,2,3,4]] }
for i in range(18):
  ort_inputs[f'past_key_values.{i}.key'] = np.ndarray(shape=(1,1,0,256), dtype=np.single)
  ort_inputs[f'past_key_values.{i}.value'] = np.ndarray(shape=(1,1,0,256), dtype=np.single)

ort_logits = ort_model.run(None, ort_inputs)[0]

# Now compare hf_logits and ort_logits

Below are logits values produced by these methods.
hf_logits.txt
ort_logits.txt

Expected behavior

I expect the logits in both cases to at least be similar values given the same inputs - of course, the main difference in the inputs here is the inclusion of past_key_values.*.(key|value) as input to the ORT model. With other converted models, I've done as in the sample, passing a zero-length, appropriately-dimensioned numpy array. Hopefully this is as simple as changing the way I specify the "first pass" as opposed to a past_key_values pass :)

Covering my bases...

import torch
from optimum.onnxruntime import ORTModelForCausalLM

model = ORTModelForCausalLM.from_pretrained('/path/to/gemma')
logits = model(torch.LongTensor([[1,2,3,4,5]]), torch.Tensor([[1,1,1,1,1]]), torch.LongTensor([[0,1,2,3,4]]))['logits']

The logits here look like the pure ORT version in the original issue description.

Hi @jacob-vincent-mink , I see you haven't enabled eval mode in your comparison.
Running the following script:

import torch
from transformers import AutoModelForCausalLM
from optimum.onnxruntime import ORTModelForCausalLM

torch.manual_seed(0)
torch.cuda.manual_seed(0)

input_ids = torch.randint(10, 110, (1, 100), dtype=torch.long)
position_ids = torch.arange(100, dtype=torch.long).unsqueeze(0)
attention_mask = torch.ones(1, 100, dtype=torch.long)

ort_model = ORTModelForCausalLM.from_pretrained("./gemma")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b-it")
model.eval()

with torch.inference_mode():
    output = model(input_ids, attention_mask=attention_mask, position_ids=position_ids)
    ort_output = ort_model(input_ids, attention_mask=attention_mask, position_ids=position_ids)

print(output.logits)
print(ort_output.logits)

torch.testing.assert_close(ort_output.logits, output.logits, rtol=1e-3, atol=1e-4)

I get almost identical logits

Mismatched elements: 1615 / 25600000 (0.0%)
Greatest absolute difference: 0.0008251667022705078 at index (0, 8, 169499) (up to 0.0001 allowed)
Greatest relative difference: 18.47104263305664 at index (0, 8, 52871) (up to 0.001 allowed)

The ones that aren't matching seem to be mostly very near zero (hence the big relative difference)

>>> ort_output.logits[0, 96, 169953]
tensor(4.9114e-05)
>>> output.logits[0, 96, 169953]
tensor(1.6689e-06)

In general OnnxRuntime can't output 100% identical logits but they're close enough (for example the probability of the top +99% of tokens in a text generation model is the same):

torch.testing.assert_close(ort_output.logits.softmax(-1), output.logits.softmax(-1))

Mismatched elements: 99 / 25600000 (0.0%)
Greatest absolute difference: 8.20457935333252e-05 at index (0, 12, 1) (up to 1e-05 allowed)
Greatest relative difference: 0.0003821254940703511 at index (0, 8, 1) (up to 1.3e-06 allowed)

@IlyasMoutawwakil thanks for the information! I will try this out and report back.

It’s worth noting that the ONNX model I’m having trouble with was converted with the optimum-cli - does the CLI also perform the call to eval() before/during conversion?

I’m trying to use the model in C# with ONNXRuntime, which does not have an equivalent eval() as far as I’m aware. Therefore I would expect that to be baked into the ONNX file itself.

The way this problem actually manifests is that the logits values in Python are “small”, while the logits values from my converted model are “large”, which leads to floating-point overflow in a more strongly typed language like C# when I try to run things like Softmax on the output. Moving everything to double is obviously a workaround, but not preferred given the overhead of touching every logit to do so.

Hello,
I'm encountering an issue similar to what @jacob-vincent-mink described, with notably large logit values following the conversion of the Gemma model to ONNX using the CLI.

System Information:
ONNX: 1.15.0
ONNXRuntime: 1.17.1
Optimum: 1.18.0.dev0
Python: 3.10.13
Tokenizers: 0.15.2
PyTorch: 2.2.1
Transformers: 4.39.0.dev0

During the conversion process using the command:

optimum-cli export onnx --model google/gemma-2b-it validated_gemma/

the validation step indicates that "The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance of 1e-05." The logit values I'm obtaining are closely aligned with the ones @jacob-vincent-mink has reported. I've attached the output from the CLI conversion for reference (see cli_conversion_output.txt).

cli_conversion_output.txt

@IlyasMoutawwakil What conversion method are you using to achieve similar logit output?

Additionally, when I attempted to replicate the results using the script provided to demonstrate similar logits and of course my output did not share the expected outcome (refer to validation_output.txt).

validation_output.txt

@nickrwann @jacob-vincent-mink I can not reproduce the issue, using

optimum-cli export onnx --model google/gemma-2b-it gemma_onnx_with_past

with the environment:

optimum==d87efb25c98741501fbf6da0d270fc181611b795
transformers==d47966536cd5ac1ed7e140edac65f00f471f656f
torch==2.2.1
python==3.10.13
tokenizers==0.15.2
onnx==1.15.0
onnxruntime==1.17.1
accelerate not installed
Platform: Linux-5.4.0-166-generic-x86_64-with-glibc2.31

getting

(py310) felix@hf-dgx-01:~/optimum$ optimum-cli export onnx --model google/gemma-2b-it gemma_onnx_with_past --task text-generation-with-past
Framework not specified. Using pt to export the model.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00,  3.17it/s]
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
        - use_cache -> True
/home/felix/transformers/src/transformers/models/gemma/modeling_gemma.py:969: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > self.causal_mask.shape[-1]:
Saving external data to one file...
Post-processing the exported models...
Weight deduplication check in the ONNX export requires accelerate. Please install accelerate to run it.
Validating ONNX model gemma_onnx_with_past/model.onnx...
        -[✓] ONNX model output names match reference model (present.7.key, present.3.value, present.9.key, present.15.key, logits, present.10.value, present.2.key, present.1.key, present.11.value, present.6.value, present.9.value, present.5.key, present.5.value, present.13.key, present.2.value, present.3.key, present.11.key, present.4.key, present.17.key, present.6.key, present.4.value, present.12.value, present.8.key, present.16.value, present.0.key, present.16.key, present.17.value, present.12.key, present.7.value, present.8.value, present.14.key, present.10.key, present.15.value, present.14.value, present.1.value, present.0.value, present.13.value)
        - Validating ONNX Model output "logits":
                -[✓] (2, 16, 256000) matches (2, 16, 256000)
                -[x] values not close enough, max diff: 0.0001392364501953125 (atol: 1e-05)
        - Validating ONNX Model output "present.0.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.0.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.1.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.1.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.2.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.2.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.3.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.3.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.4.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.4.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.5.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.5.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.6.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.6.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.7.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.7.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.8.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.8.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.9.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.9.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.10.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.10.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.11.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.11.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.12.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.12.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 1.2069940567016602e-05 (atol: 1e-05)
        - Validating ONNX Model output "present.13.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.13.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.14.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.14.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.15.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.15.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.16.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 1.7523765563964844e-05 (atol: 1e-05)
        - Validating ONNX Model output "present.16.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.17.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.17.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 0.0001392364501953125
- present.12.value: max diff = 1.2069940567016602e-05
- present.16.key: max diff = 1.7523765563964844e-05.
 The exported model was saved at: gemma_onnx_with_past

Are you using Linux as well?

It’s worth noting that the ONNX model I’m having trouble with was converted with the optimum-cli - does the CLI also perform the call to eval() before/during conversion?

Yes:

optimum/optimum/exporters/onnx/convert.py

Line 534 in cf82249

model = model.eval()

@fxmarty Thanks for your comments. All my experiments have been done on Windows.

I did the following:

Create a new virtual environment
pip install git+https://github.com/huggingface/transformers.git git+https://github.com/huggingface/optimum.git onnxruntime onnx torch
optimum-cli export onnx --model google/gemma-2b-it --framework pt --task text-generation-with-past ./gemma-2b-it

I got the following result showing a massive max difference in the logits.

(venv) PS C:\BEA\Work\SQUID\takoyaki\conversion> optimum-cli export onnx --model google/gemma-2b-it --framework pt --task text-generation-with-past ./gemma-2b-it
config.json: 100%|████████████████████████████████████████████████████████████████████████████| 627/627 [00:00<?, ?B/s]
C:\BEA\Work\SQUID\takoyaki\conversion\venv\Lib\site-packages\huggingface_hub\file_download.py:149: UserWarning: `huggingface_hub` cache-system uses symlinks by default to efficiently store duplicated files but your machine does not support them in C:\Users\Jacob_Mink\AppData\Local\direnv\cache\huggingface\hub\models--google--gemma-2b-it. Caching files will still work but in a degraded version that might require more space on your disk. This warning can be disabled by setting the `HF_HUB_DISABLE_SYMLINKS_WARNING` environment variable. For more details, see https://huggingface.co/docs/huggingface_hub/how-to-cache#limitations.
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
  warnings.warn(message)
model.safetensors.index.json: 100%|███████████████████████████████████████████████| 13.5k/13.5k [00:00<00:00, 13.5MB/s]
model-00001-of-00002.safetensors: 100%|███████████████████████████████████████████| 4.95G/4.95G [01:20<00:00, 61.2MB/s]
model-00002-of-00002.safetensors: 100%|███████████████████████████████████████████| 67.1M/67.1M [00:01<00:00, 66.4MB/s]
Downloading shards: 100%|████████████████████████████████████████████████████████████████| 2/2 [01:22<00:00, 41.18s/it]
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:02<00:00,  1.44s/it]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████| 137/137 [00:00<?, ?B/s]
tokenizer_config.json: 100%|██████████████████████████████████████████████████████████████| 2.16k/2.16k [00:00<?, ?B/s]
tokenizer.model: 100%|████████████████████████████████████████████████████████████| 4.24M/4.24M [00:00<00:00, 33.4MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████| 17.5M/17.5M [00:00<00:00, 30.7MB/s]
special_tokens_map.json: 100%|████████████████████████████████████████████████████████████████| 888/888 [00:00<?, ?B/s]
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.1+cpu
Overriding 1 configuration item(s)
        - use_cache -> True
C:\BEA\Work\SQUID\takoyaki\conversion\venv\Lib\site-packages\transformers\models\gemma\modeling_gemma.py:989: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > self.causal_mask.shape[-1]:
C:\BEA\Work\SQUID\takoyaki\conversion\venv\Lib\site-packages\transformers\models\gemma\modeling_gemma.py:912: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  normalizer = torch.tensor(self.config.hidden_size**0.5, dtype=hidden_states.dtype)
Saving external data to one file...
Post-processing the exported models...
Weight deduplication check in the ONNX export requires accelerate. Please install accelerate to run it.
Validating ONNX model gemma-2b-it/model.onnx...
        -[✓] ONNX model output names match reference model (present.4.key, present.17.key, present.2.value, present.1.key, present.13.value, present.16.key, present.3.value, present.17.value, present.9.key, present.6.key, present.1.value, present.8.key, present.14.value, present.10.key, present.7.value, present.0.value, present.2.key, logits, present.14.key, present.16.value, present.12.key, present.9.value, present.12.value, present.6.value, present.10.value, present.5.key, present.0.key, present.5.value, present.4.value, present.7.key, present.8.value, present.15.key, present.11.key, present.3.key, present.15.value, present.11.value, present.13.key)
        - Validating ONNX Model output "logits":
                -[✓] (2, 16, 256000) matches (2, 16, 256000)
                -[x] values not close enough, max diff: 1923.869384765625 (atol: 1e-05)
        - Validating ONNX Model output "present.0.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.0.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.1.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.1.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.2.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.2.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.3.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.3.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.4.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.4.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.5.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.5.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.6.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.6.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.7.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.7.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.8.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 13.872063636779785 (atol: 1e-05)
        - Validating ONNX Model output "present.8.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 2.6775338649749756 (atol: 1e-05)
        - Validating ONNX Model output "present.9.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 21.117443084716797 (atol: 1e-05)
        - Validating ONNX Model output "present.9.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 4.053847789764404 (atol: 1e-05)
        - Validating ONNX Model output "present.10.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 18.636192321777344 (atol: 1e-05)
        - Validating ONNX Model output "present.10.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 3.941821336746216 (atol: 1e-05)
        - Validating ONNX Model output "present.11.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 12.09212589263916 (atol: 1e-05)
        - Validating ONNX Model output "present.11.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 5.360763072967529 (atol: 1e-05)
        - Validating ONNX Model output "present.12.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 12.459362983703613 (atol: 1e-05)
        - Validating ONNX Model output "present.12.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 3.4856996536254883 (atol: 1e-05)
        - Validating ONNX Model output "present.13.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 16.745580673217773 (atol: 1e-05)
        - Validating ONNX Model output "present.13.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 3.8524701595306396 (atol: 1e-05)
        - Validating ONNX Model output "present.14.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 11.162510871887207 (atol: 1e-05)
        - Validating ONNX Model output "present.14.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 8.507028579711914 (atol: 1e-05)
        - Validating ONNX Model output "present.15.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 9.137032508850098 (atol: 1e-05)
        - Validating ONNX Model output "present.15.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 10.084770202636719 (atol: 1e-05)
        - Validating ONNX Model output "present.16.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 9.555559158325195 (atol: 1e-05)
        - Validating ONNX Model output "present.16.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 4.076621055603027 (atol: 1e-05)
        - Validating ONNX Model output "present.17.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 12.241983413696289 (atol: 1e-05)
        - Validating ONNX Model output "present.17.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 10.537141799926758 (atol: 1e-05)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 1923.869384765625
- present.8.key: max diff = 13.872063636779785
- present.8.value: max diff = 2.6775338649749756
- present.9.key: max diff = 21.117443084716797
- present.9.value: max diff = 4.053847789764404
- present.10.key: max diff = 18.636192321777344
- present.10.value: max diff = 3.941821336746216
- present.11.key: max diff = 12.09212589263916
- present.11.value: max diff = 5.360763072967529
- present.12.key: max diff = 12.459362983703613
- present.12.value: max diff = 3.4856996536254883
- present.13.key: max diff = 16.745580673217773
- present.13.value: max diff = 3.8524701595306396
- present.14.key: max diff = 11.162510871887207
- present.14.value: max diff = 8.507028579711914
- present.15.key: max diff = 9.137032508850098
- present.15.value: max diff = 10.084770202636719
- present.16.key: max diff = 9.555559158325195
- present.16.value: max diff = 4.076621055603027
- present.17.key: max diff = 12.241983413696289
- present.17.value: max diff = 10.537141799926758.
 The exported model was saved at: gemma-2b-it

@fxmarty @nickrwann Looks like this is an optimum-cli on Windows issue. I ran it in WSL and got the following:

(venv2) jacob@W11JQ9MZZ2:/mnt/c/BEA/Work/SQUID/takoyaki/conversion$ optimum-cli export onnx --model google/gemma-2b-it --framework pt --task text-generation-with-past ./gemma-2b-it
config.json: 100%|█████████████████████████████████████████████████████████| 627/627 [00:00<00:00, 1.75MB/s]
model.safetensors.index.json: 100%|████████████████████████████████████| 13.5k/13.5k [00:00<00:00, 39.2MB/s]
model-00001-of-00002.safetensors: 100%|████████████████████████████████| 4.95G/4.95G [01:09<00:00, 71.0MB/s]
model-00002-of-00002.safetensors: 100%|████████████████████████████████| 67.1M/67.1M [00:00<00:00, 68.9MB/s]
Downloading shards: 100%|█████████████████████████████████████████████████████| 2/2 [01:11<00:00, 35.75s/it]
Gemma's activation function should be approximate GeLU and not exact GeLU.
Changing the activation function to `gelu_pytorch_tanh`.if you want to use the legacy `gelu`, edit the `model.config` to set `hidden_activation=gelu`   instead of `hidden_act`. See https://github.com/huggingface/transformers/pull/29402 for more details.
Loading checkpoint shards: 100%|██████████████████████████████████████████████| 2/2 [00:02<00:00,  1.04s/it]
generation_config.json: 100%|███████████████████████████████████████████████| 137/137 [00:00<00:00, 396kB/s]
tokenizer_config.json: 100%|███████████████████████████████████████████| 2.16k/2.16k [00:00<00:00, 7.84MB/s]
tokenizer.model: 100%|█████████████████████████████████████████████████| 4.24M/4.24M [00:00<00:00, 32.0MB/s]
tokenizer.json: 100%|██████████████████████████████████████████████████| 17.5M/17.5M [00:00<00:00, 54.6MB/s]
special_tokens_map.json: 100%|█████████████████████████████████████████████| 888/888 [00:00<00:00, 3.78MB/s]
Using the export variant default. Available variants are:
    - default: The default ONNX variant.
Using framework PyTorch: 2.2.1+cu121
Overriding 1 configuration item(s)
        - use_cache -> True
/mnt/c/BEA/Work/SQUID/takoyaki/conversion/venv2/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py:989: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if seq_length > self.causal_mask.shape[-1]:
/mnt/c/BEA/Work/SQUID/takoyaki/conversion/venv2/lib/python3.10/site-packages/transformers/models/gemma/modeling_gemma.py:912: TracerWarning: torch.tensor results are registered as constants in the trace. You can safely ignore this warning if you use this function to create tensors out of constant variables that would be the same every time you call this function. In any other case, this might cause the trace to be incorrect.
  normalizer = torch.tensor(self.config.hidden_size**0.5, dtype=hidden_states.dtype)
Saving external data to one file...
Post-processing the exported models...
Weight deduplication check in the ONNX export requires accelerate. Please install accelerate to run it.
Validating ONNX model gemma-2b-it/model.onnx...
        -[✓] ONNX model output names match reference model (present.13.key, logits, present.15.key, present.1.key, present.0.key, present.0.value, present.10.key, present.12.value, present.14.key, present.17.key, present.6.value, present.4.value, present.3.value, present.5.value, present.14.value, present.4.key, present.12.key, present.10.value, present.5.key, present.9.value, present.7.value, present.6.key, present.1.value, present.16.key, present.8.value, present.11.value, present.3.key, present.17.value, present.9.key, present.16.value, present.11.key, present.2.value, present.2.key, present.15.value, present.7.key, present.8.key, present.13.value)
        - Validating ONNX Model output "logits":
                -[✓] (2, 16, 256000) matches (2, 16, 256000)
                -[x] values not close enough, max diff: 0.000118255615234375 (atol: 1e-05)
        - Validating ONNX Model output "present.0.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.0.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.1.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.1.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.2.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.2.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.3.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.3.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.4.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.4.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.5.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.5.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.6.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.6.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.7.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.7.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.8.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.8.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.9.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.9.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.10.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.10.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.11.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.11.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.12.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.12.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.13.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.13.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.14.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.14.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.15.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.15.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[x] values not close enough, max diff: 1.1026859283447266e-05 (atol: 1e-05)
        - Validating ONNX Model output "present.16.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.16.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.17.key":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
        - Validating ONNX Model output "present.17.value":
                -[✓] (2, 1, 32, 256) matches (2, 1, 32, 256)
                -[✓] all values close (atol: 1e-05)
The ONNX export succeeded with the warning: The maximum absolute difference between the output of the reference model and the ONNX exported model is not within the set tolerance 1e-05:
- logits: max diff = 0.000118255615234375
- present.15.value: max diff = 1.1026859283447266e-05.
 The exported model was saved at: gemma-2b-it

Obviously this looks better - so maybe the issue should say "optimum-cli conversion of Gemma fails on Windows"?

@jacob-vincent-mink thank you, next week I'll have a windows laptop at hand, I can have a look at whether the bug is in Pytorch, ORT, or Optimum.

Related: #1310