artificial-intelligence reinforcement-learning sutton sutton-book ai machine-learning

RL-Sandbox

Selected algorithms and exercises from the book Sutton, R. S. & Barton, A.: Reinforcement Learning: An Introduction. 2nd Edition, MIT Press, Cambridge, 2018.

Results of experiments are dumped to hdf5 files and are placed in .dump directory.
Gathered data are by default used to present various charts about experiment.

>>> Documentation <<<

Setup

Install

git clone https://github.com/ocraft/rl-sandbox.git
cd rl-sandbox
pip install -e .

Test

python setup.py test

Run

python -m rlbox.run --testbed=narmedbandit.SampleAverage

Command line parameters

Param	Description	Default
--testbed	[required] Name of a testbed that you want to use.	None
--start	Run experiment using a chosen testbed.	true
--plot	Plot data that was generated with a chosen testbed.	true
--help	Show a list of all flags.	false

Requirements

python >= 3.6
absl-py >= 0.7.0
h5py >= 2.9.0
numba >= 0.42
numpy >= 1.15
matplotlib >= 3.0.2
pandas >= 0.24
tables >= 3.4
tqdm >= 4.31.1

Test dependencies

pytest-runner >= 4.2
pytest == 4.0.2

Solutions

Section	Run	Output
2.3 The 10-armed Testbed	python -m rlbox.run --testbed=narmedbandit.SampleAverage	1
2.5 Tracking a Nonstationary Problem#Exercise 2.5	python -m rlbox.run --testbed=narmedbandit.WeightedAverage	1
2.6 Optimistic Initial Values	python -m rlbox.run --testbed=narmedbandit.OptInitVal	1
2.7 Upper-Confidence-Bound Action Selection	python -m rlbox.run --testbed=narmedbandit.Ucb	1
2.8 Gradient Bandit Algorithm	python -m rlbox.run --testbed=narmedbandit.Gradient	1
2.10 Summary	python -m rlbox.run --testbed=narmedbandit.ParamStudy	1
4.3 Policy Iteration#Exercise 4.7	python -m rlbox.run --testbed=car_rental_v1	1 2
4.3 Policy Iteration#Exercise 4.9	python -m rlbox.run --testbed=car_rental_v2	1 2
4.4 Value Iteration#Exercise 4.7	python -m rlbox.run --testbed=gambler.0.4	1 2
4.4 Value Iteration#Exercise 4.7	python -m rlbox.run --testbed=gambler.0.25	1 2
4.4 Value Iteration#Exercise 4.7	python -m rlbox.run --testbed=gambler.0.55	1 2
5.7 Off-policy Monte Carlo Control#Exercise 5.12	python -m rlbox.run --testbed=racetrack	1
6.4. Sarsa: On-policy TD Control#Exercise 6.9	python -m rlbox.run --testbed=gridworld.windy	1
6.4. Sarsa: On-policy TD Control#Exercise 6.10	python -m rlbox.run --testbed=gridworld.windy_stochastic	1
7.2. n-step Sarsa	python -m rlbox.run --testbed=gridworld.NStepSarsa	1
8.2 Dyna: Integrated Planning, Acting, and Learning	python -m rlbox.run --testbed=maze.DynaQ	1
8.3 When the Model Is Wrong#Exercise 8.4	python -m rlbox.run --testbed=maze.DynaQ+	1
10.1 Episodic Semi-gradient Control#Example 10.1: Mountain Car Task	python -m rlbox.run --testbed=mountain_car.SemiGradientSarsa	1 2
12.7 Sarsa(λ)	python -m rlbox.run --testbed=mountain_car.TrueSarsaLambda	1 2
13.5 Actor–Critic Methods	python -m rlbox.run --testbed=mountain_car.ActorCritic	1 2

Experiments

PC i7-4770 CPU @ 3.4GHZ; 16 GB RAM; GeForce GTX 660; cpython

Testbed	Environment	Exe	Time [s]
narmedbandit.SampleAverage	N-Armed Bandit [steps=1000, arms=10, stationary=True] Runs: 2000	(smpl_avg, epsilon: 0.0) (smpl_avg, epsilon: 0.01) (smpl_avg, epsilon: 0.1)	11
narmedbandit.WeightedAverage	N-Armed Bandit [steps: 10000, arms=10, stationary=False] Runs: 2000	(smpl_avg, epsilon: 0.1) (weight_avg, epsilon: 0.1, alpha: 0.2)	78
narmedbandit.OptInitVal	N-Armed Bandit [steps: 1000, arms=10, stationary=True] Runs: 2000	(weight_avg, epsilon: 0.0, alpha: 0.1, bias: 5.0) (weight_avg, epsilon: 0.1, alpha: 0.1, bias: 0.0)	7.51
narmedbandit.Ucb	N-Armed Bandit [steps: 1000, arms=10, stationary=True] Runs: 2000	(smpl_avg, epsilon: 0.1) (ucb, c: 2)	11.78
narmedbandit.Gradient	N-Armed Bandit [steps: 1000, arms=10, stationary=True, mean=4.0] Runs: 2000	(gradient, alpha: 0.1, baseline: True) (gradient, alpha: 0.4, baseline: True) (gradient, alpha: 0.1, baseline: False) (gradient, alpha: 0.4, baseline: False)	105
narmedbandit.ParamStudy	N-Armed Bandit [steps: 1000, arms=10, stationary=True] Runs: 2000	(SMPL_AVG, epsilon: 1/128) (SMPL_AVG, epsilon: 1/64) (SMPL_AVG, epsilon: 1/32) (SMPL_AVG, epsilon: 1/16) (SMPL_AVG, epsilon: 1/8) (SMPL_AVG, epsilon: 1/4) (GRADIENT, alpha: 1/32) (GRADIENT, alpha: 1/16) (GRADIENT, alpha: 1/8) (GRADIENT, alpha: 1/4) (GRADIENT, alpha: 1/2) (GRADIENT, alpha: 1) (GRADIENT, alpha: 2) (GRADIENT, alpha: 4) (UCB, c: 1/16) (UCB, c: 1/8) (UCB, c: 1/4) (UCB, c: 1/2) (UCB, c: 1) (UCB, c: 2) (UCB, c: 4) (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 1/4) (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 1/2) (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 1) (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 2) (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 4)	303
carrental.JackCarRentalV1	Jack’s Car Rental [max_move=5, max_cars=20, expct=[3, 4, 3, 2]]	gamma=0.9, epsilon=1.0	441 (mdp generation) 258 (policy iteration)
carrental.JackCarRentalV2	Jack’s Car Rental [max_move=5, max_cars=20, expct=[3, 4, 3, 2], modified=True]	gamma=0.9, epsilon=1.0	440 (mdp generation) 219 (policy iteration)
gambler.0.4	Gambler’s Problem [ph=0.4]	gamma=1.0, epsilon=1e-9	22
gambler.0.25	Gambler’s Problem [ph=0.25]	gamma=1.0, epsilon=1e-9	16
gambler.0.55	Gambler’s Problem [ph=0.55]	gamma=1.0, epsilon=0.01	11
racetrack	RaceTrack [steps=10000] Runs: 50000	gamma=1.0	1091 (episodes generation) 187 (off-policy monte carlo learning)
gridworld.windy	WindyGridWorld [stochastic=False] Runs: 200	gamma=1.0, alpha=0.5, epsilon=0.1	0.05
gridworld.windy_stochastic	WindyGridWorld [stochastic=True] Runs: 200	gamma=1.0, alpha=0.5, epsilon=0.1	0.33
gridworld.NStepSarsa	WindyGridWorld [stochastic=False] Runs: 200	n=3, gamma=1.0, alpha=0.5, epsilon=0.1	0.32
gridworld.NStepSarsa	WindyGridWorld [stochastic=False] Runs: 200	n=3, gamma=1.0, alpha=0.5, epsilon=0.1	0.32
maze.DynaQ	Maze(maze_type=0) Runs: 30	DYNA_Q, n=50, gamma=0.95, alpha=0.1, epsilon=0.1, episodes=50 DYNA_Q, n=5, gamma=0.95, alpha=0.1, epsilon=0.1, episodes=50 DYNA_Q, n=0, gamma=0.95, alpha=0.1, epsilon=0.1, episodes=50	18
maze.DynaQ+	Maze(maze_type=1) Runs: 30	DYNA_Q, n=10, gamma=0.95, alpha=1.0, epsilon=0.1, episodes=50, kappa=0, steps=3000 DYNA_Q, n=10, gamma=0.95, alpha=1.0, epsilon=0.1, episodes=50, kappa=1e-4, steps=3000 DYNA_Q_V2, n=10, gamma=0.95, alpha=1.0, epsilon=0.1, episodes=50, kappa=1e-4, steps=3000	29
mountain_car.SemiGradientSarsa	MountainCar() Runs: 10	SEMIGRADIENT_SARSA, gamma=1.0, alpha=0.5, epsilon=0.0, episodes=500	52
mountain_car.TrueSarsaLambda	MountainCar() Runs: 10	TRUE_SARSA_LAMBDA, gamma=1.0, alpha=0.5, epsilon=0.0, lmbda=0.9, episodes=500	51
mountain_car.ActorCriticLambda	MountainCar() Runs: 10	ACTOR_CRITIC, gamma=1.0, alpha_w=0.2, alpha_theta=0.01, lambda_w=0.9, lambda_theta=0.9, episodes=500	115

Bibliography

Sutton, R. (2018). Reinforcement Learning. 2nd ed. Cambridge: MIT Press.

About

Selected algorithms and exercises from the book Sutton, R. S. & Barton, A.: Reinforcement Learning: An Introduction. 2nd Edition, MIT Press, Cambridge, 2018.

artificial-intelligence reinforcement-learning sutton sutton-book ai machine-learning

MIT License

Languages

Language:Python 100.0%