ngruver / llmtime

Home Page:https://arxiv.org/abs/2310.07820

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

How were the normalized scores aggregated?

abdulfatir opened this issue Β· comments

Thank you for releasing the code! This is a very interesting piece of work. Congrats on the NeurIPS acceptance! πŸŽ‰

As per my understanding, you're aggregating normalized scores to report the final scaled score. It looks like you're using the arithmetic mean to aggregate the normalized scores. Please correct me if I am wrong.

Using the arithmetic mean may not be the best way of summarizing a normalized metric. This may lead to misleading conclusions. A better way to aggregate normalized scores is using the geometric mean. Please check this paper out for details:

Fleming, Philip J., and John J. Wallace. "How not to lie with statistics: the correct way to summarize benchmark results." Communications of the ACM 29.3 (1986): 218-221.

Based on the numbers in https://github.com/ngruver/llmtime/blob/main/precomputed_outputs/deterministic_csvs/monash.csv, here are the plots that I get using the arithmetic and geometric mean.

image

image

Thanks for the note Abdul!

The reported values are an arithmetic mean and you're correct that this is probably suboptimal. Genuine apologies for the error on my part.

I am planning to update the arxiv with extended experiments from our NeurIPS camera-ready and I'll include this correction as well.

Please let me know if you have any other comments.

Nate

@ngruver Thanks for your reply. It's an easy mistake to make. In fact, I only found out about the geometric mean idea very recently. Looking forward to the updated results.

Cheers!