ocraft / rl-sandbox

Selected algorithms and exercises from the book Sutton, R. S. & Barton, A.: Reinforcement Learning: An Introduction. 2nd Edition, MIT Press, Cambridge, 2018.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

RL-Sandbox

Selected algorithms and exercises from the book Sutton, R. S. & Barton, A.: Reinforcement Learning: An Introduction. 2nd Edition, MIT Press, Cambridge, 2018.

  • Results of experiments are dumped to hdf5 files and are placed in .dump directory.

  • Gathered data are by default used to present various charts about experiment.

Setup

Install
git clone https://github.com/ocraft/rl-sandbox.git
cd rl-sandbox
pip install -e .
Test
python setup.py test
Run
python -m rlbox.run --testbed=narmedbandit.SampleAverage
Command line parameters
Param Description Default

--testbed

[required] Name of a testbed that you want to use.

None

--start

Run experiment using a chosen testbed.

true

--plot

Plot data that was generated with a chosen testbed.

true

--help

Show a list of all flags.

false

Requirements

  • python >= 3.6

  • absl-py >= 0.7.0

  • h5py >= 2.9.0

  • numba >= 0.42

  • numpy >= 1.15

  • matplotlib >= 3.0.2

  • pandas >= 0.24

  • tables >= 3.4

  • tqdm >= 4.31.1

Test dependencies
  • pytest-runner >= 4.2

  • pytest == 4.0.2

Solutions

Section Run Output

2.3 The 10-armed Testbed

python -m rlbox.run --testbed=narmedbandit.SampleAverage

1

2.5 Tracking a Nonstationary Problem#Exercise 2.5

python -m rlbox.run --testbed=narmedbandit.WeightedAverage

1

2.6 Optimistic Initial Values

python -m rlbox.run --testbed=narmedbandit.OptInitVal

1

2.7 Upper-Confidence-Bound Action Selection

python -m rlbox.run --testbed=narmedbandit.Ucb

1

2.8 Gradient Bandit Algorithm

python -m rlbox.run --testbed=narmedbandit.Gradient

1

2.10 Summary

python -m rlbox.run --testbed=narmedbandit.ParamStudy

1

4.3 Policy Iteration#Exercise 4.7

python -m rlbox.run --testbed=car_rental_v1

1 2

4.3 Policy Iteration#Exercise 4.9

python -m rlbox.run --testbed=car_rental_v2

1 2

4.4 Value Iteration#Exercise 4.7

python -m rlbox.run --testbed=gambler.0.4

1 2

4.4 Value Iteration#Exercise 4.7

python -m rlbox.run --testbed=gambler.0.25

1 2

4.4 Value Iteration#Exercise 4.7

python -m rlbox.run --testbed=gambler.0.55

1 2

5.7 Off-policy Monte Carlo Control#Exercise 5.12

python -m rlbox.run --testbed=racetrack

1

6.4. Sarsa: On-policy TD Control#Exercise 6.9

python -m rlbox.run --testbed=gridworld.windy

1

6.4. Sarsa: On-policy TD Control#Exercise 6.10

python -m rlbox.run --testbed=gridworld.windy_stochastic

1

7.2. n-step Sarsa

python -m rlbox.run --testbed=gridworld.NStepSarsa

1

8.2 Dyna: Integrated Planning, Acting, and Learning

python -m rlbox.run --testbed=maze.DynaQ

1

8.3 When the Model Is Wrong#Exercise 8.4

python -m rlbox.run --testbed=maze.DynaQ+

1

10.1 Episodic Semi-gradient Control#Example 10.1: Mountain Car Task

python -m rlbox.run --testbed=mountain_car.SemiGradientSarsa

1 2

12.7 Sarsa(λ)

python -m rlbox.run --testbed=mountain_car.TrueSarsaLambda

1 2

13.5 Actor–Critic Methods

python -m rlbox.run --testbed=mountain_car.ActorCritic

1 2

Experiments

PC i7-4770 CPU @ 3.4GHZ; 16 GB RAM; GeForce GTX 660; cpython
Testbed Environment Exe Time [s]

narmedbandit.SampleAverage

  • N-Armed Bandit [steps=1000, arms=10, stationary=True]

  • Runs: 2000

  • (smpl_avg, epsilon: 0.0)

  • (smpl_avg, epsilon: 0.01)

  • (smpl_avg, epsilon: 0.1)

11

narmedbandit.WeightedAverage

  • N-Armed Bandit [steps: 10000, arms=10, stationary=False]

  • Runs: 2000

  • (smpl_avg, epsilon: 0.1)

  • (weight_avg, epsilon: 0.1, alpha: 0.2)

78

narmedbandit.OptInitVal

  • N-Armed Bandit [steps: 1000, arms=10, stationary=True]

  • Runs: 2000

  • (weight_avg, epsilon: 0.0, alpha: 0.1, bias: 5.0)

  • (weight_avg, epsilon: 0.1, alpha: 0.1, bias: 0.0)

7.51

narmedbandit.Ucb

  • N-Armed Bandit [steps: 1000, arms=10, stationary=True]

  • Runs: 2000

  • (smpl_avg, epsilon: 0.1)

  • (ucb, c: 2)

11.78

narmedbandit.Gradient

  • N-Armed Bandit [steps: 1000, arms=10, stationary=True, mean=4.0]

  • Runs: 2000

  • (gradient, alpha: 0.1, baseline: True)

  • (gradient, alpha: 0.4, baseline: True)

  • (gradient, alpha: 0.1, baseline: False)

  • (gradient, alpha: 0.4, baseline: False)

105

narmedbandit.ParamStudy

  • N-Armed Bandit [steps: 1000, arms=10, stationary=True]

  • Runs: 2000

  • (SMPL_AVG, epsilon: 1/128)

  • (SMPL_AVG, epsilon: 1/64)

  • (SMPL_AVG, epsilon: 1/32)

  • (SMPL_AVG, epsilon: 1/16)

  • (SMPL_AVG, epsilon: 1/8)

  • (SMPL_AVG, epsilon: 1/4)

  • (GRADIENT, alpha: 1/32)

  • (GRADIENT, alpha: 1/16)

  • (GRADIENT, alpha: 1/8)

  • (GRADIENT, alpha: 1/4)

  • (GRADIENT, alpha: 1/2)

  • (GRADIENT, alpha: 1)

  • (GRADIENT, alpha: 2)

  • (GRADIENT, alpha: 4)

  • (UCB, c: 1/16)

  • (UCB, c: 1/8)

  • (UCB, c: 1/4)

  • (UCB, c: 1/2)

  • (UCB, c: 1)

  • (UCB, c: 2)

  • (UCB, c: 4)

  • (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 1/4)

  • (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 1/2)

  • (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 1)

  • (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 2)

  • (WEIGHT_AVG, epsilon: 0.0, alpha: 0.1, bias: 4)

303

carrental.JackCarRentalV1

  • Jack’s Car Rental [max_move=5, max_cars=20, expct=[3, 4, 3, 2]]

  • gamma=0.9, epsilon=1.0

  • 441 (mdp generation)

  • 258 (policy iteration)

carrental.JackCarRentalV2

  • Jack’s Car Rental [max_move=5, max_cars=20, expct=[3, 4, 3, 2], modified=True]

  • gamma=0.9, epsilon=1.0

  • 440 (mdp generation)

  • 219 (policy iteration)

gambler.0.4

  • Gambler’s Problem [ph=0.4]

  • gamma=1.0, epsilon=1e-9

22

gambler.0.25

  • Gambler’s Problem [ph=0.25]

  • gamma=1.0, epsilon=1e-9

16

gambler.0.55

  • Gambler’s Problem [ph=0.55]

  • gamma=1.0, epsilon=0.01

11

racetrack

  • RaceTrack [steps=10000]

  • Runs: 50000

  • gamma=1.0

  • 1091 (episodes generation)

  • 187 (off-policy monte carlo learning)

gridworld.windy

  • WindyGridWorld [stochastic=False]

  • Runs: 200

  • gamma=1.0, alpha=0.5, epsilon=0.1

0.05

gridworld.windy_stochastic

  • WindyGridWorld [stochastic=True]

  • Runs: 200

  • gamma=1.0, alpha=0.5, epsilon=0.1

0.33

gridworld.NStepSarsa

  • WindyGridWorld [stochastic=False]

  • Runs: 200

  • n=3, gamma=1.0, alpha=0.5, epsilon=0.1

0.32

gridworld.NStepSarsa

  • WindyGridWorld [stochastic=False]

  • Runs: 200

  • n=3, gamma=1.0, alpha=0.5, epsilon=0.1

0.32

maze.DynaQ

  • Maze(maze_type=0)

  • Runs: 30

  • DYNA_Q, n=50, gamma=0.95, alpha=0.1, epsilon=0.1, episodes=50

  • DYNA_Q, n=5, gamma=0.95, alpha=0.1, epsilon=0.1, episodes=50

  • DYNA_Q, n=0, gamma=0.95, alpha=0.1, epsilon=0.1, episodes=50

18

maze.DynaQ+

  • Maze(maze_type=1)

  • Runs: 30

  • DYNA_Q, n=10, gamma=0.95, alpha=1.0, epsilon=0.1, episodes=50, kappa=0, steps=3000

  • DYNA_Q, n=10, gamma=0.95, alpha=1.0, epsilon=0.1, episodes=50, kappa=1e-4, steps=3000

  • DYNA_Q_V2, n=10, gamma=0.95, alpha=1.0, epsilon=0.1, episodes=50, kappa=1e-4, steps=3000

29

mountain_car.SemiGradientSarsa

  • MountainCar()

  • Runs: 10

  • SEMIGRADIENT_SARSA, gamma=1.0, alpha=0.5, epsilon=0.0, episodes=500

52

mountain_car.TrueSarsaLambda

  • MountainCar()

  • Runs: 10

  • TRUE_SARSA_LAMBDA, gamma=1.0, alpha=0.5, epsilon=0.0, lmbda=0.9, episodes=500

51

mountain_car.ActorCriticLambda

  • MountainCar()

  • Runs: 10

  • ACTOR_CRITIC, gamma=1.0, alpha_w=0.2, alpha_theta=0.01, lambda_w=0.9, lambda_theta=0.9, episodes=500

115

Bibliography

  • Sutton, R. (2018). Reinforcement Learning. 2nd ed. Cambridge: MIT Press.

About

Selected algorithms and exercises from the book Sutton, R. S. & Barton, A.: Reinforcement Learning: An Introduction. 2nd Edition, MIT Press, Cambridge, 2018.

License:MIT License


Languages

Language:Python 100.0%