The Unsupervised Reinforcement Learning Suite (URLS)

URLS aims to provide a set of unsupervised reinforcement learning algorithms and experiments for the purpose of researching the applicability of unsupervised reinforcement learning to a variety of paradigms.

The codebase is based upon URLB and ExORL. Further details are provided in the following papers:

URLS is intended as a successor to URLB allowing for an increased number of experiments and RL paradigms.

Prerequisites

Install MuJoCo if it is not already the case:

Download MuJoCo binaries here.
Unzip the downloaded archive into ~/.mujoco/.
Append the MuJoCo subdirectory bin path into the env variable LD_LIBRARY_PATH.

Install the following libraries:

sudo apt update
sudo apt install libosmesa6-dev libgl1-mesa-glx libglfw3 unzip

Install dependencies:

conda env create -f conda_env.yml
conda activate urls-env

Workflow

We provide the following workflows:

Unsupervised Reinforcement Learning

Pre-training, learn from agents intrinsic reward on a specific domain

python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN

Fine-tuning, learn with the pre-trained agent on a specific, task specific reward is now used for the agent

python finetune.py pretrained_agent=UNSUPERVISED_AGENT task=TASK snapshot_ts=TS obs_type=OBS_TYPE

Offline Learning from Unsupervised Reinforcement Learning

Pre-training, learn from agents intrinsic reward on a specific domain

python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN

Sampling, sample demos from agent replay buffer on a specific task

python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPE

Offline-learning, learn a policy using the offline data collected on the specific task.

python train_offline.py agent=OFFLINE_AGENT expl_agent=UNSUPERVISED_AGENT task=TASK

Safe Reinforcement Learning

Pre-training, learn from agents intrinsic reward on a specific domain

python pretrain.py agent=UNSUPERVISED_AGENT domain=DOMAIN

Sampling, sample demos from agent replay buffer with constraints and images

python sampling.py agent=UNSUPERVISED_AGENT task=TASK samples=SAMPLES snapshot_ts=TS obs_type=OBS_TYPE

Trajectories to Images, create image dataset from trajectories

python data_to_images.py --env=DOMAIN

Train VAE, train Variational Auto Encoder from the image dataset

python train_encoder.py --env=DOMAIN

Train MPC, train LS3 safe model predictive controller on specific domain

python train_mpc.py --env=DOMAIN

Further details found here

Unsupervised Agents

The following unsupervised reinforcement learning agents are available, replace UNSUPERVISED_AGENT with Command. For example to use DIAYN, set UNSUPERVISED_AGENT = diayn.

Agent	Command	Type	Implementation Author(s)	Paper	Intrinsic Reward
ICM	`icm`	Knowledge	Denis	paper	$\| \| g(\mathbf{z}{t+1} \| \mathbf{z}{t}, \mathbf{a}{t}) - \mathbf{z}{t+1} \| \| ^{2}$
Disagreement	`disagreement`	Knowledge	Catherine	paper	$Var{ g_{i} (\mathbf{z}{t+1} \| \mathbf{z}{t}, \mathbf{a}_{t}) }$
RND	`rnd`	Knowledge	Kevin	paper	$\| \| g(\mathbf{z}{t}, \mathbf{a}{t}) - \tilde{g}(\mathbf{z}{t}, \mathbf{a}{t}) \| \| ^{2}_{2}$
APT(ICM)	`icm_apt`	Data	Hao, Kimin	paper	$\sum_{j \in random} \log \| \| \mathbf{z}{t} - \mathbf{z}{j} \| \|$
APT(Ind)	`ind_apt`	Data	Hao, Kimin	paper	$\sum_{j \in random} \log \| \| \mathbf{z}{t} - \mathbf{z}{j} \| \|$
ProtoRL	`proto`	Data	Denis	paper	$\sum_{j \in random} \log \| \| \mathbf{z}{t} - \mathbf{z}{j} \| \|$
DIAYN	`diayn`	Competence	Misha	paper	$\log q(\mathbf{w}\|\mathbf{z}) + const$
APS	`aps`	Competence	Hao, Kimin	paper	$r_{t}^{APT}(\mathbf{z}) + \log q(\mathbf{z} \| \mathbf{w})$
SMM	`smm`	Competence	Albert	paper	$\log p^{*}(\mathbf{z}) - \log q_{\mathbf{w}}(\mathbf{z}) - \log p(\mathbf{w}) + \log d(\mathbf{w} \| \mathbf{z})$

Offline Agents

The following 5 RL procedures are available to learn a policy offline from unsupervised data. Replace OFFLINE_AGENT with Command, for example to use behavioral cloning, set OFFLINE_AGENT = bc.

Offline RL Procedure	Command	Paper
Behavior Cloning	`bc`	paper
CQL	`cql`	paper
CRR	`crr`	paper
TD3+BC	`td3_bc`	paper
TD3	`td3`	paper

Environments

The following environments with specific domains and tasks are provided. We also provide a wrapper to convert Gym environments to DMC extended time-step types based on DeepMind's acme wrapper.

Environment Type	Domain	Task
Deep Mind Control	`walker`	`stand`, `walk`, `run`, `flip`
Deep Mind Control	`quadruped`	`walk`, `run`, `stand`, `jump`
Deep Mind Control	`jaco`	`reach_top_left`, `reach_top_right`, `reach_bottom_left`, `reach_bottom_right`
Deep Mind Control	`cheetah`	`run`
Gym Box2D	`BipedalWalker-v3`	`walk`
Gym Box2D	`CarRacing-v1`	`race`
Gym Classic Control	`MountainCarContinuous-v0`	`goal`
Safe Control	`SimplePointBot`	`goal`

License

The majority of URLS including the ExORL & URLB based code is licensed under the MIT license, however portions of the project are available under separate license terms: DeepMind is licensed under the Apache 2.0 license.

AOS55 / url-suite

The Unsupervised Reinforcement Learning Suite (URLS)

Prerequisites

Workflow

Unsupervised Reinforcement Learning

Offline Learning from Unsupervised Reinforcement Learning

Safe Reinforcement Learning

Unsupervised Agents

Offline Agents

Environments

License

About

Languages