mlcommons / inference

Reference implementations of MLPerf™ inference benchmarks

Home Page:https://mlcommons.org/en/groups/inference

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

LLama2-70B - Accuracy scores lower than expected using the reference implementation

rgandikota opened this issue · comments

We use the Llama reference code.
The Accuracy scores from the evaluation are lower than the ones mentioned in the README.md file.

Any insights are greatly appreciated!

At Fp16, the accuracies are expected to be lower. But by what percentage?

There is about a 25% difference we are observing @arjunsuresh

@rgandikota I have only run llama2-7b model and so can't really comment on this. But this is the accuracy we got on bfloat16 and llama2-7b
(42.0595, 19.853, 26.7729, 1194.4)

@arjunsuresh Thank you for the quick response! Do the submission rules mandate that we must meet the accuracy scores published in the reference implementation?

@rgandikota yes, llama2 has 2 variants- one requires 99% of the reference accuracy to be met and another require 99.9% of the reference accuracy to be met. If either is not met, the option is to submit to the "open" division where no accuracy constraint is there. For example, the accuracy I shared will go to the open division.

Of the 4 accuracy metrics for llama2, the final one is the tokens per sample and for this the accuracy threshold is 90% and not 99/99.9 as can be seen here

Also, to do a closed division llama2 submission, both offline and server scenario must be run and both meet the accuracy constraints. But in the open division, we can submit individual scenario results.

@arjunsuresh Thank you so much for the insights. Really appreciate them.
Any pointers on what might cause the low scores in your experience? We are trying to debug if it has something to do with our code

@rgandikota You're welcome. Unfortunately we haven't tried to run LLAMA2 using TensorRT-LLM like you are trying to do. So, I can't really help here. @nv-alicheng can you please help?

Hi @rgandikota , you need to become a partner of NVIDIA for TRT-LLM related support for MLPerf. Please reach out to NVIDIA internal customer representative or SAs to get help regarding this.

@nvzhihanj Thank you, we are looking into the option for the next rounds. This issue was meant to be a more generic question around how we can go about debugging accuracy issues if we use the reference implementation. Didn't mean to make it TRT-LLM specific.

To debug accuracy issue, you would want to make sure the reference implementation works, and compare per-sequence, or even per-step/per-layer output of the model.