Urgent question about data aggregates
slerman12 opened this issue · comments
Hi, we compiled the Atari 100k results from DrQ, CURL, and DER, and the mean/median human-norm scores are well below those reported in prior works, including from co-authors of the rliable paper.
We have median human-norm scores all around 0.10 - 0.12.
Is this accurate? Of all of these, DER (the oldest of the algs) has the highest mean human-norm score.