DeepRL

Highly modularized implementation of popular deep RL algorithms by PyTorch. My principal here is to reuse as much components as I can through different algorithms, use as less tricks as I can and switch easily between classical control tasks like CartPole and Atari games with raw pixel inputs.

Implemented algorithms:

Deep Q-Learning (DQN)
Double DQN
Dueling DQN
(Async) Advantage Actor Critic (A3C / A2C)
Async One-Step Q-Learning
Async One-Step Sarsa
Async N-Step Q-Learning
Continuous A3C
Distributed Deep Deterministic Policy Gradient (Distributed DDPG, aka D3PG)
Parallelized Proximal Policy Optimization (P3O, similar to DPPO)
Action Conditional Video Prediction

Curves

Curves for CartPole are trivial so I didn't place it here. There isn't any fixed random seed.

DQN, Double DQN, Dueling DQN

The network and parameters here are exactly same as the DeepMind Nature paper. Training curve is smoothed by a window of size 100. All the models are trained in a server with Xeon E5-2620 v3 and Titan X. For Breakout, test is triggered every 1000 episodes with 50 repetitions. In total, 16M frames cost about 4 days and 10 hours. For Pong, test is triggered every 10 episodes with no repetition. In total, 4M frames cost about 18 hours.

I referred this repo.

Discrete A3C

The network I used here is a smaller network with only 42 * 42 input, alougth the network for DQN can also work here, it's quite slow.

Training of A3C took about 2 hours (16 processes) in a server with two Xeon E5-2620 v3. While other async methods took about 1 day. Those value based async methods do work but I don't know how to make them stable. This is the test curve. Test is triggered in a separate deterministic test process every 50K frames.

I referred this repo for the parallelization.

Continuous A3C

For continuous A3C and DPPO, I use fixed unit variance rather than a separate head, so entropy weight is simply set to 0. Of course you can also use another head to output variance. In that case, a good practice is to bound your mean while leave variance unbounded, which is also included in the implementation.

D3PG

Extra caution is necessary when computing gradients. The repo I referred for DDPG is wrong in computing the deterministic gradients at least at this commit. Theoretically I believe that implementation should work, but in practice it doesn't work. Even this is PyTorch you need to manually deal with gradients in this case. DDPG is not very stable.

Setting the number of workers to 1 will reduce the implementation to exact DDPG. I have to adopt the most straightforward distribution method, as P3O and A3C style distribution doesn't work for DDPG. The figures were done with 6 workers.

P3O

The difference between my implementation and DeepMind's DPPO is:

PPO stands for different algorithms.
I use a much simpler A3C-like synchronization protocol.

The body of PPO is based on this repo. However that implementation has two critical bugs at least at this commit. Its computation of the clipped loss is correct with one-dimensional action by accident, but is wrong with high-dimensional action. And its computation of entropy is wrong in any case.

I use 8 threads and a two tanh hidden layer network, each hidden layer has 64 hidden units.

Action Conditional Video Prediction

Left: One-step prediction Right: Ground truth

Prediction is sampled after 110K iterations and I only implemented one-step training

Dependency

Tested in macOS 10.12 and CentO/S 6.8

Open AI gym
Roboschool (Optional)
PyTorch v0.3.0
Python 2.7
TensorboardX

Usage

dataset.py: generate dataset for action conditional video prediction

main.py: all other algorithms

References

About

Highly modularized implementation of popular deep RL algorithms by PyTorch

Apache License 2.0

Languages

Language:Python 100.0%