How were the normalized scores aggregated?

Question

How were the normalized scores aggregated?

abdulfatir opened this issue 7 months ago · comments

Thank you for releasing the code! This is a very interesting piece of work. Congrats on the NeurIPS acceptance! 🎉

As per my understanding, you're aggregating normalized scores to report the final scaled score. It looks like you're using the arithmetic mean to aggregate the normalized scores. Please correct me if I am wrong.

Using the arithmetic mean may not be the best way of summarizing a normalized metric. This may lead to misleading conclusions. A better way to aggregate normalized scores is using the geometric mean. Please check this paper out for details:

Fleming, Philip J., and John J. Wallace. "How not to lie with statistics: the correct way to summarize benchmark results." Communications of the ACM 29.3 (1986): 218-221.

Based on the numbers in https://github.com/ngruver/llmtime/blob/main/precomputed_outputs/deterministic_csvs/monash.csv, here are the plots that I get using the arithmetic and geometric mean.

Nate Gruver · Answer 1 · Mon Dec 04 2023 04:17:40 GMT+0800 (China Standard Time)

Thanks for the note Abdul!

The reported values are an arithmetic mean and you're correct that this is probably suboptimal. Genuine apologies for the error on my part.

I am planning to update the arxiv with extended experiments from our NeurIPS camera-ready and I'll include this correction as well.

Please let me know if you have any other comments.

Nate

Abdul Fatir · Answer 2 · Mon Dec 04 2023 16:24:36 GMT+0800 (China Standard Time)

@ngruver Thanks for your reply. It's an easy mistake to make. In fact, I only found out about the geometric mean idea very recently. Looking forward to the updated results.

Cheers!