TODOs:

fix viusalizations in todo
add refs (rawlik paper/thesis, prr)
"bigger" tabular expt overrides: Add Heaven & Hell experiment in the tabular case
Theory comparison overrides: Compare multilogu w/ and w/o the 1/A factor in chi calc.
Batch theta overrides: Same w/ periodic updates of ref s,a,s'
Implement LR schedule
Create a folder when one is missing for logging
Make custom FPS for rendering envs (esp. Atari)

New Features:

Experimental questions:

Does stabilizing theta help stabilize logu? (i.e. fix theta to g.t. value)
Test the use of clipping theta (min_reward, max_reward) and logu (no theoretical bounds, but -50/50 after norm. to avoid divergence)
Which params most strongly affect logu oscillations?
"..." affect logu divergence?
Why does using off-policy (pi0) for exploration make logu diverge?
Which activation function is best? softplus >> relu for u-learning
Which aggregration of theta is best (min/mean/max), same for logu (min is suggested to help with over-optimistic behavior)

Features requiring experiments:

Clipping theta
use target or online logu for exploration (greedy or not?)
smooth out theta learning

Long-Term TODOs:

Notes:

I added this line to the gymnasium/envs/__init__.py file:

register(
    id="Simple-v0",
    entry_point="gymnasium.envs.classic_control.simple_env:SimpleEnv",
    max_episode_steps=10,
    reward_threshold=1.0,
)

I also placed "simple_env" in classic control folder.

Life saver for mujoco setup: https://pytorch.org/rl/reference/generated/knowledge_base/MUJOCO_INSTALLATION.html
Important line when facing GL error: export MUJOCO_GL="glfw"

Acrobot performance on logu (note logscale x axis):

And same for SB3's DQN with their hparams (huggingface):

Model-based ground truth comparisons with tabular algorithms:

Model-free ground truth comparisons:

For updating requirements.txt:

pip list --format=freeze > requirements.txt

the contents of requirements.txt are a bit sensitive for the git actions testing... (e.g. have to remove some conda stuff)

About

A new average-reward entropy regularized RL algorithm: EVAL (EigenVector-based Average-reward Learning)

Languages

Language:Python 99.0%Language:Shell 1.0%