bootstrapped ci (shows no variance) vs std (shows high variance)
MarcoMeter opened this issue · comments
Hey folks!
I frequently follow rliable's guidelines to plot sample efficiency curves. I came across results now where 5 seeds of one experiment had large variance, but the bootstrapped confidence interval suggests little to no variance. Here are two plots to visualize my issue:
The number of bootstrap replications is set to 50000.
Here is a colab notebook to reproduce these plots:
https://colab.research.google.com/drive/1hFtmCX-TLUcPuDKZZlTPq34R7bDz_NWI?usp=sharing
It would be great to hear your intuitions about this. Do you think this is just a coincidence or a bug?
edit:
- Lowering the reps to 3000 did not affect the plot
- Reshaping from
750 episodes, 101 checkpoints
to5 runs, 150 episodes, 101 checkpoints
did not affect the plot
I'd have to take a closer look sometime next week but usually this issue happens due to not bootstrapping over the correct axis (The readme specifies shape of the data expected). I think you want to switch the task and seed axis to fix this.
If you have a single task, then you can turn on task_bootstrap=True to not worry about shape related issues. https://github.com/google-research/rliable/blob/master/rliable/library.py#L215
Thanks for your reply @agarwl
My current take is to have the data in the shape of (5 runs, 150 episodes, 101 checkpoints)
.
Compared to your terms: checkpoints = frames
, episodes = games
, runs = training repetitions = tasks
This is the result if task_bootstrap = False
and if task_bootstrap = True
. The intervals with task bootstrapping are more pronounced.
With task_bootstrap = True
and a shape of (750 episodes, 101 checkpoints)
, the CIs are messed up.
Given the same shape of (750, 101)
and task_bootstrap = False
, the plot seems equivalent to the first one.