Discrepancy between results reported in this repo and in the NeoX paper

Question

Discrepancy between results reported in this repo and in the NeoX paper

william-cerebras opened this issue a year ago · comments

Hello. I recently noticed that the downstream numbers reported in this repo (and on the huggingface page) don't quite match up with what I get when I run eval myself using the lm-evaluation-harness. The numbers I get are consistent with what was reported in the GPT-NeoX paper. For example, this repo reports a zero-shot HellaSwag score of 66.1, while I (and the NeoX authors) get a score of 51.8. I was hoping you could help me get to the bottom of what the differences are in evaluation methodology, since it seems like you also use the same eval harness. I have already ruled out dataset contamination as a source of the difference as neither my evaluation nor the NeoX evaluation uses test-time decontamination. Thanks in advance for helping clarify.

Ben Wang · Answer 1 · Fri Mar 31 2023 00:02:12 GMT+0800 (China Standard Time)

For multiple choice evals, the eval harness either ranks the choices with sum of logprob (reported as acc) or the average logprob per token (reported as acc_norm). This matches the evaluation procedure in the GPT3 paper, "For most tasks we compare the per-token likelihood (to normalize for length)". I chose the evaluation method for each model and benchmark combination which to maximize the score. For most benchmarks, the difference is very small, but it makes a large difference in hellaswag.

William Marshall · Answer 2 · Fri Mar 31 2023 02:26:57 GMT+0800 (China Standard Time)

@kingoflolz Thanks for the quick response - that clarifies things a lot