Spiky CRMDP Roadmap
jvmncs opened this issue · comments
jvmncs commented
Main road
- Create toy environments
- Create new toy environments (@timorl's repo)
- Clean up toy environments for use with Gym API
- Add toy environments as a dependency (#38)
- Debug toy environments (david-lindner/safe-grid-gym#15)
- Refactor for use with Gym API (#32)
- Modify ai_safety_gridworlds_gym to fit our needs (@david-lindner's fork)
- Improve dependency management #31
- Switch all code referencing envs to use Gym env
- Improved tooling for hyperparameter tuning (e.g. Ray)
- Estimate compute costs and finalize logistics
- First guess for an upper bound: 1 agent x 4 environments x 3 experiments = 12 sets of hyperparameters to tune x ~30 training runs = 360 runs x 2 hours
- Do experiments Start with experiments January 11
- Check if hparams tuned on Solver generalize to Cheater (vice versa too, but less important/rigorous)
Investigate corrupt versions of harder environmentsMaybe bigger / more realistic boat raceMaybe a modified Atari envMaybe a modified MuJoCo envMaybe modified BipedalWalker env
Finish experiments February 15
Deadline February 22
Environments:
- TomatoWateringCRMDP
- TransitionBoatRaceCRMDP
- Toy environments
- corrupt corners (satisfies our assumptions for guaranteed learnability)
- corrupt path to goal (does not satisfy assumptions for guaranteed learnability)
Experiments per env
- Baseline (learns corrupt reward)
- Cheater (learns with access to true reward)
- Solver (learns intended behavior from corrupt reward)