LLama2-70B - Accuracy scores lower than expected using the reference implementation

Question

LLama2-70B - Accuracy scores lower than expected using the reference implementation

rgandikota opened this issue 5 months ago · comments

We use the Llama reference code.
The Accuracy scores from the evaluation are lower than the ones mentioned in the README.md file.

Any insights are greatly appreciated!

Arjun Suresh · Answer 1 · Tue Feb 27 2024 16:54:25 GMT+0800 (China Standard Time)

At Fp16, the accuracies are expected to be lower. But by what percentage?

Ram Gandikota · Answer 2 · Wed Feb 28 2024 01:29:55 GMT+0800 (China Standard Time)

There is about a 25% difference we are observing @arjunsuresh

Arjun Suresh · Answer 3 · Wed Feb 28 2024 02:34:02 GMT+0800 (China Standard Time)

@rgandikota I have only run llama2-7b model and so can't really comment on this. But this is the accuracy we got on bfloat16 and llama2-7b
(42.0595, 19.853, 26.7729, 1194.4)

Ram Gandikota · Answer 4 · Wed Feb 28 2024 02:36:47 GMT+0800 (China Standard Time)

@arjunsuresh Thank you for the quick response! Do the submission rules mandate that we must meet the accuracy scores published in the reference implementation?

Arjun Suresh · Answer 5 · Wed Feb 28 2024 02:41:20 GMT+0800 (China Standard Time)

@rgandikota yes, llama2 has 2 variants- one requires 99% of the reference accuracy to be met and another require 99.9% of the reference accuracy to be met. If either is not met, the option is to submit to the "open" division where no accuracy constraint is there. For example, the accuracy I shared will go to the open division.

Of the 4 accuracy metrics for llama2, the final one is the tokens per sample and for this the accuracy threshold is 90% and not 99/99.9 as can be seen here

Arjun Suresh · Answer 6 · Wed Feb 28 2024 02:42:36 GMT+0800 (China Standard Time)

Also, to do a closed division llama2 submission, both offline and server scenario must be run and both meet the accuracy constraints. But in the open division, we can submit individual scenario results.

Ram Gandikota · Answer 7 · Wed Feb 28 2024 05:03:42 GMT+0800 (China Standard Time)

@arjunsuresh Thank you so much for the insights. Really appreciate them.
Any pointers on what might cause the low scores in your experience? We are trying to debug if it has something to do with our code

Arjun Suresh · Answer 8 · Wed Feb 28 2024 07:19:22 GMT+0800 (China Standard Time)

@rgandikota You're welcome. Unfortunately we haven't tried to run LLAMA2 using TensorRT-LLM like you are trying to do. So, I can't really help here. @nv-alicheng can you please help?

Zhihan Jiang · Answer 9 · Wed Feb 28 2024 07:55:46 GMT+0800 (China Standard Time)

Hi @rgandikota , you need to become a partner of NVIDIA for TRT-LLM related support for MLPerf. Please reach out to NVIDIA internal customer representative or SAs to get help regarding this.

Ram Gandikota · Answer 10 · Wed Feb 28 2024 11:47:11 GMT+0800 (China Standard Time)

@nvzhihanj Thank you, we are looking into the option for the next rounds. This issue was meant to be a more generic question around how we can go about debugging accuracy issues if we use the reference implementation. Didn't mean to make it TRT-LLM specific.

Zhihan Jiang · Answer 11 · Wed Feb 28 2024 11:58:18 GMT+0800 (China Standard Time)

To debug accuracy issue, you would want to make sure the reference implementation works, and compare per-sequence, or even per-step/per-layer output of the model.