andrinbuerli / rl-continuous-control

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Continuous Control

Introduction

In this project, an agent is trained to interact in two different environments whereas the action space is continuous in both of them.

Reacher

The first environment is called Reacher, here the task is to control a double-jointed arm such that it can move to target locations. A reward of +0.1 is provided for each step that the agent's hand is in the goal location. Thus, the goal of the agent is to maintain its position at the target location for as many time steps as possible. There are 20 identical agents, each with its own copy of the environment interacting independently and synchronous.

Initial (random) policy: Learned policy:

The observation space consists of 33 variables corresponding to position, rotation, velocity, and angular velocities of the arm. Each action is a vector with four numbers, corresponding to torque applicable to two joints. Every entry in the action vector should be a number between -1 and 1. The environment is considered solved if a mean reward of 35 is reached over 100 consecutive episodes.

Crawler

The second environment is called Crawler. In this continuous control environment, the goal is to teach a creature with four legs to walk forward without falling. The reward function is based on the following criterions:

  • Body velocity matches goal velocity. (normalized between (0,1))
  • Head direction alignment with goal direction. (normalized between (0,1))

It is a product of all the rewards, this helps the agent try to maximize all rewards instead of the easiest rewards. There are 12 identical agents, each with its own copy of the environment interacting independently and synchronous.

Initial (random) policy: Learned policy:

The observation space consists of 172 variables corresponding to position, rotation, velocity, and angular velocities of each limb plus the acceleration and angular acceleration of the body. Each action is a vector with 20 numbers, corresponding to target rotations for joints. The environment is considered solved if a mean reward of 3000 is reached over 100 consecutive episodes.

Getting started

  1. Install Docker

  2. If you want to log the training process, create a wandb account

  3. Replace the <API-KEY> tag in the docker-compose.yml file with your wandb api key

  4. Start the training with

    # train a ppo agent in reacher environment
    docker-compose up train_reacher_ppo
    # train a ddpg agent in reacher environment
    docker-compose up train_reacher_ddpg
    # train a ppo agent in crawler environment
    docker-compose up train_crawler_ppo
    # train a ddpg agent in crawler environment
    docker-compose up train_crawler_ddpg

If you want to visually watch a trained agent interacting with the environment, you can download the required files at

Reacher:

Crawler:

Once that the visual environments have been downloaded, you can follow the instructions in the notebooks Watch_Reacher.ipynb or Wach_crawler.ipynb.

About


Languages

Language:Python 81.3%Language:Jupyter Notebook 18.5%Language:Dockerfile 0.3%