Caffe2 Reinforcement Learning Models

How would you teach a robot to balance a pole? Or safely land a space ship? Or even to walk?

Using reinforcement learning (RL), you wouldn't have to teach it how to do any of these things: only what to do. RL formalizes our intuitions about trial and error – agents take actions, experience feedback, and adjust their behavior accordingly.

An agent may start with awful performance: the cart drops the pole immediately; when the space ship careens left, it tilts further; the walker can't take one step without falling. But with experience from exploration and failure, it learns. Soon enough, the agent is behaving in a way you never explicitly told it to, and is achieving the goals you implicitly set forth. It takes actions that optimize for the reward system you designed, often coming up with solutions and employing strategies you hadn't thought of.

While historically, RL has been primarily used in the context of robotics and game-playing, it can be employed in a variety of problem spaces. At Facebook, we're working on using RL at scale: suggesting people you may know, notifying you about page updates, personalizing our video bitrate serving, and more.

Advances in RL theory, including the advent of Deep Query Networks and Deep Actor-Critic models, allow us to use function approximation to approach problems with large state and action spaces. This project, called BlueWhale, contains Deep RL implementations built on caffe2. We provide support for running them inside OpenAI Gym.

Requirements

Recommended: Anaconda

For mac users, we recommend using Anaconda instead of the system implementation of python. Install anaconda and verify you are using Anaconda's version of python before installing other dependencies: which python should yield an Anaconda path.

Caffe2

BlueWhale runs on any platform that supports caffe2. To install caffe2, follow this tutorial: Installing Caffe2.

You may need to override caffe2's cmake defaults to use homebrew's protoc instead of Anaconda's protoc and to use Anaconda's Python instead of system Python.

Thrift

Thrift is Facebook's RPC framework.

brew install thrift

OpenAI Gym

Running models in OpenAI Gym environments requires platforms with OpenAI Gym support. Windows support for OpenAI Gym is being tracked here.

OpenAI Gym can be installed using pip which should come with your python installation in the case of linux or with anaconda in the case of OSX. To install the basic environments (classic control, toy text, and algorithmic), run:

pip install gym

To install all environments, run this instead:

pip install "gym[all]"

Installation and Setup

Clone from source:

git clone https://github.com/caffe2/BlueWhale

To make thrift accessible to our system, run from within the root directory:

thrift --gen py --out . ml/rl/thrift/core.thrift

To access caffe2 and import our modules:

export PYTHONPATH=/usr/local:$PYTHONPATH

Running Unit Tests

From within the root directory, run all of our unit tests with:

python -m unittest discover

To run a specific unit test:

python -m unittest <path/to/unit_test.py>

Running Models in OpenAI Gym

You can run RL models of your specification on OpenAI Gym environments of your choice. Right now, we only support environments that supply Box(x, ) or Box(x, y, z) (image) state representations and require Discrete(y) action inputs.

Quickstart

python ml/rl/test/gym/run_gym.py -p ml/rl/test/gym/maxq_cartpole_v0.json

The run_gym.py script will construct an RL model and run it in an OpenAI Gym environemnt, periodically reporting scores averaged over several trials. In general, you can run RL models in OpenAI Gym environments with:

python ml/rl/test/gym/run_gym.py -p <parameters_file> [-s <score_bar>] [-g <gpu_id>]

parameters_file: Path to your JSON parameters file
score_bar (optional): Scalar score you hope to achieve. Once your model scores at least score_bar well averaged over 100 test trials, training will stop and the script will exit. If left empty, training will continue until you complete collect data from num_episodes episodes (see details on parameters in the next section)
gpu_id (optional): If set to your machine's GPU id (typically 0), the model will run its training and inference on your GPU. Otherwise it will use your CPU

Feel free to create your own parameter files to select different environments and change model parameters. The success criteria for different environments can be found here. We currently supply default arguments for the following environments:

CartPole-v0 environment: maxq_cartpole_v0.json
CartPole-v1 environment: maxq_cartpole_v1.json
LunarLander-v2 environment: maxq_lunarlander_v2.json

Feel free to try out image-based environments too! The parameters we supply will get you a model that runs and trains quickly, not one that performs well:

Asteroids-v0 environment: maxq_asteroids_v0.json

Modifying the parameters file

As an example, The Cartpole-v0 default parameter file we supply specifies the use of an RL model whose backing neural net has 5 layers:

{
    "env": "CartPole-v0",
    "rl": {
        "reward_discount_factor": 0.99,
        "target_update_rate": 0.1,
        "reward_burnin": 10,
        "maxq_learning": 1,
        "epsilon": 0.2
    },
    "training": {
        "layers": [-1, 256, 128, 64, -1],
        "activations": ["relu", "relu", "relu", "linear"],
        "minibatch_size": 128,
        "learning_rate": 0.005,
        "optimizer": "ADAM",
        "learning_rate_decay": 0.999
    },
    "run_details": {
        "num_episodes": 301,
        "train_every": 10,
        "train_after": 10,
        "test_every": 100,
        "test_after": 10,
        "num_train_batches": 100,
        "train_batch_size": 1024,
        "avg_over_num_episodes": 100,
        "render": 0,
        "render_every": 100
    }
}

You can supply a different JSON parameter file, modifying the fields to your liking.

env: The OpenAI gym environment to use
rl
- reward_discount_factor: A measure of how quickly the model's target network updates
- target_update_rate: A measure of how quickly the model's target network updates
- reward_burnin: The iteration after which to use the model's target network to construct target values
- maxq_learning: 1 for Q-learning, 0 for SARSA
- epsilon: Fraction of the time the agent should select a random action during training
training
- layers: An array whose ith entry specifies the number of nodes in the ith layer of the Neural Net. Use -1 for the input and output layers; our models will fill in the appropriate values based on your choice of environment
- activations: A array whose ith entry specifies the activation function to use between the ith and i+1th layers. Valid choices are "linear" and "relu". Note that this array should have one fewer entry than your entry for layers
- minibatch_size: The number of transitions to train the Neural Net on at a time. This will not effect the total number of datapoints supplied. In general, lower/higher minibatch sizes perform better with lower/higher learning rates
- learning_rate: Learning rate for the neural net
- optimizer: Neural net weight update algorithm. Valid choices are "SGD", "ADAM", "ADAGRAD", and "FTRL"
- learning_rate_decay: Factor by which the learning rate decreases after each training minibatch
run_details (reading the code that uses these might be helpful: run_gym.py)
- num_episodes: Number of episodes run the mode and to collect new data over
- train_every: Number of episodes between each training cycle
- train_after Number of episodes after which to enable training
- test_every: Number of episodes between each test cycle
- test_after: Number of episodes after which to enable testing
- num_train_batches: Number of batches to train over each training cycle
- train_batch_size: Number of transitions to include in each training batch. Note that these will each be further broken down into minibatches of size minibatch_size
- avg_over_num_episodes: Number of episodes to run every test cycle. After each cycle, the script will report an average over the scores of the episodes run within it.The typical choice is 100, but this should be set according to the success criteria for your environment
- render: Whether or not to render the OpenAI environment in training and testing episodes. Note that some environments don't support rendering
- render_every: Number of episodes between each rendered episode

Supported Models

We use Deep Q Network implementations for our models. See dqn-Atari by Deepmind.

Max-Q-Learning (as demonstrated in paper):
- input: state: s, action a
- output: scalar Q(s, a)
- update target on transition {state, action, reward, next_state, next_action}:
  - Q_target(state, action) = reward + reward_discount_factor * max_{possible_next_action} Q(next_state, possible_next_action)
SARSA:
- input: state s, action a
- output: scalar Q(s, a)
- update target on transition {state, action, reward, next_state, next_action}:
  - Q_target(state, action) = reward + reward_discount_factor * Q(next_state, next_action)

Both of these accept discrete and parametric action inputs.

Discrete (but still one-hotted) action implementation: DiscreteActionTrainer
Parametric action implementation: ContinuousActionDQNTrainer

Contact Us

If you identify any issues or have feedback, please file an issue.

Otherwise feel free to contact jjg@fb.com or nishadsingh@fb.com with questions.

License

BlueWhale is BSD-licensed. We also provide an additional patent grant.

caozhengquan / BlueWhale