bootstrapped ci (shows no variance) vs std (shows high variance)

Question

bootstrapped ci (shows no variance) vs std (shows high variance)

MarcoMeter opened this issue a year ago · comments

Hey folks!

I frequently follow rliable's guidelines to plot sample efficiency curves. I came across results now where 5 seeds of one experiment had large variance, but the bootstrapped confidence interval suggests little to no variance. Here are two plots to visualize my issue:

The number of bootstrap replications is set to 50000.
Here is a colab notebook to reproduce these plots:
https://colab.research.google.com/drive/1hFtmCX-TLUcPuDKZZlTPq34R7bDz_NWI?usp=sharing

It would be great to hear your intuitions about this. Do you think this is just a coincidence or a bug?

edit:

Lowering the reps to 3000 did not affect the plot
Reshaping from 750 episodes, 101 checkpoints to 5 runs, 150 episodes, 101 checkpoints did not affect the plot

Marco Pleines · Answer 1 · Fri Aug 11 2023 15:20:31 GMT+0800 (China Standard Time)

The grey curve is the most problematic one. The IQM already shows strong volatility, while the stratified bootstrapped confidence interval is very narrow.

Utilizing less data or further lowering the reps do not seem to effect the intervals.

Rishabh Agarwal · Answer 2 · Mon Aug 14 2023 11:28:40 GMT+0800 (China Standard Time)

I'd have to take a closer look sometime next week but usually this issue happens due to not bootstrapping over the correct axis (The readme specifies shape of the data expected). I think you want to switch the task and seed axis to fix this.

If you have a single task, then you can turn on task_bootstrap=True to not worry about shape related issues. https://github.com/google-research/rliable/blob/master/rliable/library.py#L215

Marco Pleines · Answer 3 · Mon Aug 14 2023 14:42:04 GMT+0800 (China Standard Time)

Thanks for your reply @agarwl

My current take is to have the data in the shape of (5 runs, 150 episodes, 101 checkpoints).
Compared to your terms: checkpoints = frames, episodes = games, runs = training repetitions = tasks

This is the result if task_bootstrap = False

and if task_bootstrap = True. The intervals with task bootstrapping are more pronounced.

With task_bootstrap = True and a shape of (750 episodes, 101 checkpoints), the CIs are messed up.

Given the same shape of (750, 101) and task_bootstrap = False, the plot seems equivalent to the first one.