google-research / rliable

[NeurIPS'21 Outstanding Paper] Library for reliable evaluation on RL and ML benchmarks, even with only a handful of seeds.

Home Page:https://agarwl.github.io/rliable

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

bootstrapped ci (shows no variance) vs std (shows high variance)

MarcoMeter opened this issue · comments

Hey folks!

I frequently follow rliable's guidelines to plot sample efficiency curves. I came across results now where 5 seeds of one experiment had large variance, but the bootstrapped confidence interval suggests little to no variance. Here are two plots to visualize my issue:

comparison(1)

The number of bootstrap replications is set to 50000.
Here is a colab notebook to reproduce these plots:
https://colab.research.google.com/drive/1hFtmCX-TLUcPuDKZZlTPq34R7bDz_NWI?usp=sharing

It would be great to hear your intuitions about this. Do you think this is just a coincidence or a bug?

edit:

  • Lowering the reps to 3000 did not affect the plot
  • Reshaping from 750 episodes, 101 checkpoints to 5 runs, 150 episodes, 101 checkpoints did not affect the plot

emm

The grey curve is the most problematic one. The IQM already shows strong volatility, while the stratified bootstrapped confidence interval is very narrow.

Utilizing less data or further lowering the reps do not seem to effect the intervals.

I'd have to take a closer look sometime next week but usually this issue happens due to not bootstrapping over the correct axis (The readme specifies shape of the data expected). I think you want to switch the task and seed axis to fix this.

If you have a single task, then you can turn on task_bootstrap=True to not worry about shape related issues. https://github.com/google-research/rliable/blob/master/rliable/library.py#L215

Thanks for your reply @agarwl

My current take is to have the data in the shape of (5 runs, 150 episodes, 101 checkpoints).
Compared to your terms: checkpoints = frames, episodes = games, runs = training repetitions = tasks

This is the result if task_bootstrap = False

trxl_gt_ci_0

and if task_bootstrap = True. The intervals with task bootstrapping are more pronounced.

trxl_gt_ci_1

With task_bootstrap = True and a shape of (750 episodes, 101 checkpoints), the CIs are messed up.

trxl_gt_ci_2

Given the same shape of (750, 101) and task_bootstrap = False, the plot seems equivalent to the first one.

trxl_gt_ci_3