ivankunyankin / imitation_learning

PyTorch implementation of some reinforcement learning algorithms: A2C, PPO, Behavioral Cloning from Observation (BCO), GAIL.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

PyTorch Reinforcement and Imitation Learning

This repository contains parallel PyTorch implementation of some Reinforcement and Imitation Learning algorithms: A2C, PPO, BCO, GAIL, V-trace. Short description:

  • Advantage Actor Critic (A2C) - a synchronous variant of A3C
  • Proximal Policy Optimization (PPO) - one of the most popular RL algorithms PPO, Truly PPO, Implementation Matters, A Large-Scale Empirical Study of PPO
  • Behavioral Cloning from Observation (BCO) - technique to clone expert behaviour into agent using only expert states, BCO
  • Generative Adversarial Imitation Learning (GAIL) - algorithm to mimic expert policy using discriminator as reward model GAIL

Each algorithm supports vector/image/dict observation spaces and discrete/continuous/tuple action spaces. Data gathering and training on it controlled by separate processes, parallelism scheme is described in this file. Code is written with focus on on-policy algorithms; Recurrent policies also supported.

Current Functionality

Each algorithm supports discrete (Categorical, Bernoulli, GumbelSoftmax) and continuous (Beta, Normal, tanh(Normal)) policy distributions, and there is additional 'Tuple' distribution which can be used for mixing distributions above. For continuous action spaces Beta distribution works best in my experiments (tested on BipedalWalker and Humanoid environments).

Environments with vector, image or dict observation spaces are supported. Recurrent policies are supported.

Several returns estimation algorithms supported: 1-step, n-step, GAE and V-Trace (Introduced in IMPALA paper).

As found in paper Implementation Matters, PPO algo works mostly because of "code-level" optimizations. Here I implemented most of them:

  • Value function clipping (works better without it)
  • Observation normalization & clipping
  • Reward normalization/scaling & clipping
  • Orthogonal initialization of neural network weights
  • Gradient clipping
  • Learning rate annealing (will be added... sometime)

In addition, I implemented roll-back loss from Truly PPO paper, which works well.

How to use

Clone the repo, install python module:

git clone https://github.com/CherryPieSexy/imitation_learning.git
cd imitation_learning/
pip install -e .

Training example

Each experiment is described in config, look at the config. To run experiment execute command:

python configs/cart_pole/cart_pole_ppo_annotated.py

Training results (including training config, tensorboard logs and model checkpoints) will be saved in log_dir folder.

Obtained policy:

cartpole

Testing example

Results of trained policy may be shown with cherry_rl/test.py script. To run in from any folder execute:

python -m cherry_rl.test -f ${PATH_TO_LOG_DIR} -p ${CHECKPOINT_ID}

This script is able to:

  • just show how policy acts in environment
  • measure mean reward and episode len over requested number of episodes
  • record demo file with trajectories

Execute python -m cherry_rl.test -h to see detailed description of available arguments.

Code structure

.
├── cherry_rl                        # folder with code
    ├── algorithms                      # algorithmic part of code
        ├── nn                          # folder with neural networks definitions.
            ├── agent_model.py          # special module for agent.
            └── ...                     # various nn models: actor-critics, convolutional & recurrent encoders.
        ├── optimizers                  # folder with RL optimizers. Each optimizer shares
            ├── model_optimizer.py      # base optimizer for all models.
            ├── actor_critic_optimizer.py
            └── ...                     # core algorithms: a2c.py, ppo.py, bco.py
        ├── parallel
            ├── readme.md               # description of used parallelism scheme
            └── ...                     # modules responsible for parallel rollout gathering and training.
        ├── returns_estimator.py        # special module for estimating returns. Supported estimators: 
        └── ...                         # all other algorithmic modules that does not fit in any other folder. 
    ├── utils
        ├── vec_env.py                  # vector env (copy of OpenAI code, but w/o automatic resetting)
        └── ...                         # environment wrappers and other utils.
    └── test.py                         # script for watching trained agent and recording demo.
├── configs                             # subfolder name = environment, script name = algo
    ├── cart_pole
        ├── cart_pole_demo_10_ep.pickle  # demo file for training BCO or GAIL
        ├── cart_pole_a2c.py
        ├── cart_pole_ppo.py
        ├── cart_pole_ppo_gru.py        # recurrent policy
        ├── cart_pole_ppo_annotated.py  # ppo training script with comments
        ├── cart_pole_bco.py
        └── cart_pole_gail.py
    ├── bipedal                         # folder with similar scripts as cart_pole
    ├── humanoid
    └── car_racing

Modular neural network definition

Each agent have optional make_obs_encoder and obs_normalizer_size arguments. Observation encoder is an neural network (i.e. nn.Module), it applied directly to observation, typically an image. Observation normalizer is an running mean-variance estimator which standardize observations, it applied before encoder. Most of times actor-critic trains better on such zero-mean unit-variance observations or embeddings.

To train your own neural network architecture you can just import or define it in config, initialize it in make_ac_model function, and pass as make_actor_critic argument into AgentModel.

Trained environments

GIFs of some of results:

BipedalWalker-v3: mean reward ~333, 0 fails over 1000 episodes, config.

bipedal

Humanoid-v3: mean reward ~11.3k, 14 fails over 1000 episodes, config.

humanoid

Experiments with Humanoid done in mujoco v2 which have integration bug that makes environment easier. For academic purposes it is correct to use mujoco v1.5

CarRacing-v0: mean reward = 894 ± 32, 26 fails over 100 episodes (episode is considered failed if reward < 900), config.

car_racing

Further plans

  • Try Motion Imitation DeepMimic paper algo
  • Add self-play trainer with PPO as backbone algo
  • ...

About

PyTorch implementation of some reinforcement learning algorithms: A2C, PPO, Behavioral Cloning from Observation (BCO), GAIL.


Languages

Language:Python 99.9%Language:Shell 0.1%