SafePO-Baselines
Safe Poicy Optimization (Safe PO) is a comprehensive benchmark based on safeRL. It provides RL research community a unified platform for processing and evaluating the ultramodern algorithms in various safe reinforcement learning environments. To better help the community study this problem, Safe PO are developed with the following key features:
- Comprehensive RL Benchmark: We offer many safe reinforcement learning algorithms both single agent and muti-agent, including cpo, pcpo, trpo_lagrangian, ppo_lagrangian, focops, macpo, macpo_lagrangian, hatrpo...
- More richer interfaces:In safePO, you can can modify the parameters of the algorithm according to your requirements. We provide customizable YAML files for each algorithm, and you can also pass in the parameters you want to change via argprase at the terminal.
- More fair, more effective:In the past, when comparing different algorithms, the number of interactions of each algorithm and the processing mode of buffer may be different. To solve this problem, we abstracted the most basic Policy Graident class and inherited all other algorithms from this class to ensure a more fair and reasonable performance comparison. In order to improve efficiency, we also support the parallelism of multi-core CPU, which greatly improves the efficiency of algorithm development and verification.
- More infos you can know:We provide a large number of data visualization interfaces. Reinforcement learning will involve many parameters in the training process. In order to better understand the changes of each parameter in the training process, we use log files and Tensorboard to visualize a large number of parameters in the training process. We think this helps developers tune each algorithm more efficiently.
Overview of Algorithms
Here we provide a table for Safe RL algorithms that the benchmark concludes.
Algorithm | Proceedings&Cites | Paper Links | Code URL | Official Code Repo | Official Code Framework | Official Code Last Update | Official Github Stars |
---|---|---|---|---|---|---|---|
PPO Lagrangian | ✗ | Openai | code | Official repo | Tensorflow 1 | ||
TRPO Lagrangian | ✗ | Openai | code | Official repo | Tensorflow 1 | ||
FOCOPS | Neurips 2020 (Cite: 27) | arxiv | code | Official repo | pytorch | ||
CPO | ICML 2017(Cite: 663) | arxiv | code | ✗ | ✗ | ✗ | ✗ |
PCPO | ICLR 2020(Cite: 67) | arxiv | code | Official repo | theano | ✗ | ✗ |
P3O | IJCAI 2022(Cite: 0) | arxiv | code | ✗ | ✗ | ✗ | ✗ |
MACPO | Preprint(Cite: 4) | arxiv | code | Official repo | pytorch | ||
MAPPO_Lagrangian | Preprint(Cite: 4) | arxiv | code | Official repo | pytorch | ||
HATRPO | ICLR 2022 (Cite: 10) | arxiv | code | Official repo | pytorch | ||
HAPPO (Purely reward optimisation) | ICLR 2022 (Cite: 10) | arxiv | code | Official repo | pytorch | ||
MAPPO (Purely reward optimisation) | Preprint(Cite: 98) | arxiv | code | Official repo | pytorch | ||
IPPO (Purely reward optimisation) | Preprint(Cite: 28) | arixiv | code | ✗ | ✗ | ✗ | ✗ |
Supported Environment
MuJoCo stands for Multi-Joint dynamics with Contact. It is a general purpose physics engine that aims to facilitate research and development in robotics, biomechanics, graphics and animation, machine learning, and other areas which demand fast and accurate simulation of articulated structures interacting with their environment. You can install from Mujoco github.
Task1: Velocity
Task Defination
The goal is for an agent to move along a straight line or a two dimensional plane, but the speed of the robot is constrained for safety purposes.
reward
The reward consists of three parts:
- alive bonus: Every timestep that the walker is alive, it gets a reward of 1,
- reward_forward: A reward of walking forward which is measured as (x-coordinate before action - x-coordinate after action)/dt. dt is the time between actions and is dependent on the frame_skip parameter (default is 4), where the dt for one frame is 0.002 - making the default dt = 4 * 0.002 = 0.008. This reward would be positive if the walker walks forward (right) desired.
- reward_control: A negative reward for penalising the walker if it takes actions that are too large. It is measured as -coefficient x sum(action2) where coefficient is a parameter set for the control and has a default value of 0.001
The total reward returned is reward = alive bonus + reward_forward + reward_control
cost
We obtain the velocity information as follows:
#next_obs, rew, done, info = env.step(act)
if 'y_velocity' not in info:
cost = np.abs(info['x_velocity'])
else:
cost = np.sqrt(info['x_velocity'] ** 2 + info['y_velocity'] ** 2)
Task2: Circle
Task Defination
The goal is for an agent to move along the circumference of a circle while remaining within a safety region smaller than the radius of the circle.
reward and cost
We set the circle cost as follows:
xy_position_before = mass_center(self.model, self.sim)
self.do_simulation(action, self.frame_skip)
xy_position_after = mass_center(self.model, self.sim)
xy_velocity = (xy_position_after - xy_position_before) / self.dt
x_v, y_v = xy_velocity
x_pos, y_pos = xy_position_after[0], xy_position_after[1]
d_xy = np.linalg.norm(xy_position_after, ord=2)
d_o = 10
x_lim = 2.5
# Get reward
reward = (- x_v * y_pos + y_v * x_pos) / (1 + np.abs(d_xy - d_o))
# Get cost
cost = np.float(np.abs(x_pos) > x_lim)
Safety Gym
PointGoal | PointButton | PointPush |
---|---|---|
CarGoal | CarButton | CarPush |
DoggoGoal | DoggoButton | DoggoPush |
More details
In each task:Goal, Button, Push, there are three levels of difficulty(with higher levels having more challenging constraints).
-
Safexp-{Robot}Goal0-v0
: A robot must navigate to a goal. -
Safexp-{Robot}Goal1-v0
: A robot must navigate to a goal while avoiding hazards. One vase is present in the scene, but the agent is not penalized for hitting it. -
Safexp-{Robot}Goal2-v0
: A robot must navigate to a goal while avoiding more hazards and vases. -
Safexp-{Robot}Button0-v0
: A robot must press a goal button. -
Safexp-{Robot}Button1-v0
: A robot must press a goal button while avoiding hazards and gremlins, and while not pressing any of the wrong buttons. -
Safexp-{Robot}Button2-v0
: A robot must press a goal button while avoiding more hazards and gremlins, and while not pressing any of the wrong buttons. -
Safexp-{Robot}Push0-v0
: A robot must push a box to a goal. -
Safexp-{Robot}Push1-v0
: A robot must push a box to a goal while avoiding hazards. One pillar is present in the scene, but the agent is not penalized for hitting it. -
Safexp-{Robot}Push2-v0
: A robot must push a box to a goal while avoiding more hazards and pillars.(To make one of the above, make sure to substitute
{Robot}
for one ofPoint
,Car
, orDoggo
.) If you want find more information about Safety Gym, you can check this.
Safe Bullet Gym
Agent
Ball | Car | Drone | Ant |
---|---|---|---|
Task
Circle | Gather | Reach | Run |
---|---|---|---|
More Description
-
Ball: A spherical shaped agent which can freely move on the xy-plane.
-
Car: A four-wheeled agent based on MIT's Racecar.
-
Drone: An air vehicle based on the AscTec Hummingbird quadrotor.
-
Ant: A four-legged animal with a spherical torso.
-
Circle: Agents are expected to move on a circle in clock-wise direction (as proposed by Achiam et al. (2017)). The reward is dense and increases by the agent's velocity and by the proximity towards the boundary of the circle. Costs are received when agent leaves the safety zone defined by the two yellow boundaries.
-
Gather Agents are expected to navigate and collect as many green apples as possible while avoiding red bombs (Duan et al. 2016). In contrast to the other tasks, agents in the gather tasks receive only sparse rewards when reaching apples. Costs are also sparse and received when touching bombs (Achiam et al. 2017).
-
Reach: Agents are supposed to move towards a goal (Ray et al. 2019). As soon the agents enters the goal zone, the goal is re-spawned such that the agent has to reach a series of goals. Obstacles are placed to hinder the agent from trivial solutions. We implemented obstacles with a physical body, into which agents can collide and receive costs, and ones without collision shape that produce costs for traversing. Rewards are dense and increase for moving closer to the goal and a sparse component is obtained when entering the goal zone.
-
Run: Agents are rewarded for running through an avenue between two safety boundaries (Chow et al. 2019). The boundaries are non-physical bodies which can be penetrated without collision but provide costs. Additional costs are received when exceeding an agent-specific velocity threshold.
If you want to find more information, you can check this.
Installation
Because you use this baselines, you need to install environments that you want test. You can check Mujoco, Safety_gym, Bullet_gym for more detail to install. Details regarding installation of IsaacGym can be found here. We currently support the Preview Release 3
version of IsaacGym.
Conda-Environment
conda create -n safe python=3.8
# because the cuda version, we recommend you install pytorch manual.
pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
pip install -e .
Machine Configuration
We test all algorithms and experiments in CPU: AMD Ryzen Threadripper PRO 3975WX 32-Cores and GPU: NVIDIA GeForce RTX 3090, Driver Version: 495.44.
Getting Started
Single Agent
Train
All algorithm codes are in file:Parallel_algorithm, for example, if you want to run ppo_lagrangian in safety_gym:Safexp-PointGoal1-v0, with cpu cores:4, seed:0,
python train.py --env_id Safexp-PointGoal1-v0 --algo ppo_lagrangian --cores 4
Configures
Argprase | default | info |
---|---|---|
--algo | required | the name of algorithm exec |
--cores | int | the number of cpu physical cores you use |
--seed | int | the seed you use |
--check_freq | int: 25 | check the snyc parameter |
--entropy_coef | float:0.01 | the parameter of entropy |
--gamma | float:0.99 | the value of dicount |
--lam | float: 0.95 | the value of GAE lambda |
--lam_c | float: 0.95 | the value of GAE cost lambda |
--max_ep_len | int: 1000 | unless environment have the default value else, we take 1000 as default value |
--max_grad_norm | float: 0.5 | the clip of parameters |
--num_mini_batches | int: 16 | used for value network tranining |
--optimizer | Adam | the optimizer of Policy other : SGD, other class in torch.optim |
--pi_lr | float: 3e-4 | the learning rate of policy |
--steps_per_epoch | int: 32000 | the number of interactor steps |
--target_kl | float: 0.01 | the value of trust region |
--train_pi_iterations | int: 80 | the number of policy learn iterations |
--train_v_iterations | int: 40 | the number of value network and cost value network iterations |
--use_cost_value_function | bool: False | use cost_value_function or not |
--use_entropy | bool:False | use entropy or not |
--use_reward_penalty | bool:False | use reward_penalty or not |
E.g. if we want use trpo_lagrangian in environment: with 10 cores and seed:0, we can run the following command:
python train.py --algo trpo_lagrangian --env_id Safexp-PointGoal1-v0 --cores 10 --seed 0
Mult-agent
About this repository
This repository provides a safe MARL baseline benchmark for safe MARL research on challenging tasks of safety DexterousHands (which is developed for MARL, named as Safe MAIG, for details, see Safe MAIG), in which the MACPO, MAPPO-lagrangian, MAPPO, HAPPO, IPPO are all implemented to investigate the safety and reward performance.
Installation
Details regarding installation of IsaacGym can be found here. We currently support the Preview Release 3
version of IsaacGym.
Pre-requisites
The code has been tested on Ubuntu 18.04 with Python 3.7. The minimum recommended NVIDIA driver
version for Linux is 470
(dictated by support of IsaacGym).
It uses Anaconda to create virtual environments. To install Anaconda, follow instructions here.
Ensure that Isaac Gym works on your system by running one of the examples from the python/examples
directory, like joint_monkey.py
. Follow troubleshooting steps described in the Isaac Gym Preview 2
install instructions if you have any trouble running the samples.
install this repo
Once Isaac Gym is installed and samples work within your current python environment, install marl package algos/marl
with:
pip install -e .
Running the benchmarks
To train your first policy, run this line in algos/marl
:
python train.py --task=ShadowHandOver --algo=macpo
Select an algorithm
To select an algorithm, pass --algo=ppo/mappo/happo/hatrpo
in algos/marl
as an argument:
python train.py --task=ShadowHandOver --algo=macpo
Select tasks
Source code for tasks can be found in dexteroushandenvs/tasks
.
Until now we only suppose the following environments:
Environments | ShadowHandOver | ShadowHandCatchUnderarm | ShadowHandTwoCatchUnderarm | ShadowHandCatchAbreast | ShadowHandOver2Underarm |
---|---|---|---|---|---|
Description | These environments involve two fixed-position hands. The hand which starts with the object must find a way to hand it over to the second hand. | These environments again have two hands, however now they have some additional degrees of freedom that allows them to translate/rotate their centre of masses within some constrained region. | These environments involve coordination between the two hands so as to throw the two objects between hands (i.e. swapping them). | This environment is similar to ShadowHandCatchUnderarm, the difference is that the two hands are changed from relative to side-by-side posture. | This environment is is made up of half ShadowHandCatchUnderarm and half ShadowHandCatchOverarm, the object needs to be thrown from the vertical hand to the palm-up hand |
Actions Type | Continuous | Continuous | Continuous | Continuous | Continuous |
Total Action Num | 40 | 52 | 52 | 52 | 52 |
Action Values | [-1, 1] | [-1, 1] | [-1, 1] | [-1, 1] | [-1, 1] |
Action Index and Description | detail | detail | detail | detail | detail |
Observation Shape | (num_envs, 2, 211) | (num_envs, 2, 217) | (num_envs, 2, 217) | (num_envs, 2, 217) | (num_envs, 2, 217) |
Observation Values | [-5, 5] | [-5, 5] | [-5, 5] | [-5, 5] | [-5, 5] |
Observation Index and Description | detail | detail | detail | detail | detail |
State Shape | (num_envs, 2, 398) | (num_envs, 2, 422) | (num_envs, 2, 422) | (num_envs, 2, 422) | (num_envs, 2, 422) |
State Values | [-5, 5] | [-5, 5] | [-5, 5] | [-5, 5] | [-5, 5] |
Rewards | Rewards is the pose distance between object and goal. You can check out the details here | Rewards is the pose distance between object and goal. You can check out the details here | Rewards is the pose distance between object and goal. You can check out the details here | Rewards is the pose distance between two object and two goal, this means that both objects have to be thrown in order to be swapped over. You can check out the details here | Rewards is the pose distance between object and goal. You can check out the details here |
Demo |
Demo
If you want to see some demo with our benchmark, you can check it Demo
The Team
The Baseline is a project contributed by MARL team at Peking University, please contact yaodong.yang@pku.edu.cn if you are interested to collaborate. We also thank the list of contributors from the following open source repositories: Spinning Up, Bullet-Safety-Gym, SvenG, Safety Gym.