Urgent question about data aggregates

Question

Urgent question about data aggregates

slerman12 opened this issue 3 years ago · comments

Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.

We have median human-norm scores all around 0.10 - 0.12.

Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.

Rishabh Agarwal · Answer 1 · Fri Nov 19 2021 12:37:36 GMT+0800 (China Standard Time)

That doesn't seem right -- the aggregate scores should match as in figure below (uses 10 runs), which can be done using the colab at bit.ly/statistical_precipice_colab:

.