ASQL and ASAC: Average-Reward variants of SQL and SAC.
(Formerly EVAL, EigenVector-based Average-reward Learning).
Environments: Gridworlds, Gymnasium's classic control and Mujoco.
- fix viusalizations in todo
- add refs (rawlik paper/thesis, prr)
- "bigger" tabular expt overrides: Add Heaven & Hell experiment in the tabular case
- Theory comparison overrides: Compare multilogu w/ and w/o the 1/A factor in chi calc.
- Batch theta overrides: Same w/ periodic updates of ref s,a,s'
- Implement LR schedule
- Create a folder when one is missing for logging
- Make custom FPS for rendering envs (esp. Atari)
- Possibly use SB3 style: :param train_freq: Update the model every
train_freq
steps. Alternatively pass a tuple of frequency and unit like(5, "step")
or(2, "episode")
. - Prioritized Experience Replay
- Monitor FPS
- Monitor min/max of logu to watch for divergence
- Add learning rate decay thru scheduler
- Add "train_freq" rather than episodic training
- Add gradient clipping
- More clever normalization to avoid logu divergence (just clamping)
- Merge Rawlik with U as an option. e.g. prior_update_interval=0 for no updates, and otherwise use Rawlik iteration
- Switch to SB3 Replay Buffer
- Does stabilizing theta help stabilize logu? (i.e. fix theta to g.t. value)
- Test the use of clipping theta (min_reward, max_reward) and logu (no theoretical bounds, but -50/50 after norm. to avoid divergence)
- Which params most strongly affect logu oscillations?
- "..." affect logu divergence?
- Why does using off-policy (pi0) for exploration make logu diverge?
- Which activation function is best? softplus >> relu for u-learning
- Which aggregration of theta is best (min/mean/max), same for logu (min is suggested to help with over-optimistic behavior)
- Clipping theta
- use target or online logu for exploration (greedy or not?)
- smooth out theta learning
- Write more tests
- V learning with cloning
- UV learning
- Test UV learning with steady state from tabular
- Effective temperature tracking from Rawlik
- Rawlik scheme (PPI)
- Generate requirements
I added this line to the gymnasium/envs/__init__.py
file:
register(
id="Simple-v0",
entry_point="gymnasium.envs.classic_control.simple_env:SimpleEnv",
max_episode_steps=10,
reward_threshold=1.0,
)
I also placed "simple_env" in classic control folder.
-
Life saver for mujoco setup: https://pytorch.org/rl/reference/generated/knowledge_base/MUJOCO_INSTALLATION.html
-
Important line when facing GL error: export MUJOCO_GL="glfw"
Acrobot performance on logu (note logscale x axis):
And same for SB3's DQN with their hparams (huggingface):
Model-based ground truth comparisons with tabular algorithms:
Model-free ground truth comparisons:
For updating requirements.txt:
pip list --format=freeze > requirements.txt
the contents of requirements.txt are a bit sensitive for the git actions testing... (e.g. have to remove some conda stuff)