- Overview
- Getting Started
- Supported Environments
- System implementations
- Usage
- Installation
- Debugging
- Roadmap
- Contributing
- Troubleshooting and FAQ
Mava is a library for building multi-agent reinforcement learning (MARL) systems. Mava provides useful components, abstractions, utilities and tools for MARL and allows for simple scaling for multi-process system training and execution while providing a high level of flexibility and composability.
π·ββοΈ NOTICE: Our release of Mava is foremost to benefit the wider community and make it easier for researchers to work on MARL. However, we consider this release a Beta version of Mava. As with many frameworks, Mava is (and will probably always remain) a work in progress and there is much more the team aims to provide and improve in future releases. From incorporating the latest research and innovations to making the framework more stable, robust and well tested. Furthermore, we are committed and will do our best to keep everything working and have the experience of using Mava be as pleasant as possible. During Beta development breaking changes may occur as well as significant design changes (if we feel it could greatly improve the useability of the framework) but these will be clearly communicated before being incorporated into the codebase. It is also inevitable that there might be bugs we are not aware of and that things might break from time to time. We will do our best to fix these bugs and address any issues as quickly as possible. β
At the core of the Mava framework is the concept of a system
. A system refers to a full multi-agent reinforcement learning algorithm consisting of the following specific components: an Executor
, a Trainer
and a Dataset
.
The Executor
is the part of the system that interacts with the environment, takes actions for each agent and observes the next state as a collection of observations, one for each agent in the system. Essentially, executors are the multi-agent version of the Actor class in Acme and are themselves constructed through feeding to the executor a dictionary of policy networks. The Trainer
is responsible for sampling data from the Dataset originally collected from the executor and updating the parameters for every agent in the system. Trainers are therefore the multi-agent version of the Learner class in Acme. The Dataset
stores all of the information collected by the executors in the form of a collection of dictionaries for the actions, observations and rewards with keys corresponding to the individual agent ids. The basic system design is shown on the left in the above figure.
Several examples of system implementations can be viewed here.
Mava shares much of the design philosophy of Acme for the same reason: to allow a high level of composability for novel research (i.e. building new systems) as well as making it possible to scale systems in a simple way, using the same underlying multi-agent RL system code. Mava uses Launchpad for creating distributed programs. In Mava, the system executor (which is responsible for data collection) is distributed across multiple processes each with a copy of the environment. Each process collects and stores data which the Trainer uses to update the parameters of all the actor networks used within each executor. This approach to distributed system training is illustrated on the right in the figure above. β NOTE: In the near future, Mava aims to support additional training setups, e.g. distributed training using multiple trainers to support Bayesian optimisation or population based training (PBT).
We have a Quickstart notebook that can be used to quickly create and train your first Multi-Agent System. For more information on how to use Mava, please view our usage section.
A given multi-agent system interacts with its environment via an EnvironmentLoop
. This loop takes as input a system
instance and a multi-agent environment
instance which implements the DeepMind Environment API. Mava currently supports multi-agent environment loops and environment wrappers for the following environments and environment suites:
MAD4PG on PettingZoo's Multi-Walker environment. | VDN on the SMAC 3m map. |
Mava includes several system implementations. Below we list these together with an indication of the maturity of the system using the following keys: π© -- Tested and working well, π¨ -- Running and training on simple environments, but not extensively tested and π₯ -- Implemented but untested and yet to show clear signs of stable training.
- π© - Multi-Agent Deep Q-Networks (MADQN).
- π© - Multi-Agent Deep Deterministic Policy Gradient (MADDPG).
- π© - Multi-Agent Distributed Distributional Deep Deterministic Policy Gradient (MAD4PG).
- π¨ - Differentiable Inter-Agent Learning (DIAL).
- π¨ - Multi-Agent Proximal Policy Optimisation (MAPPO).
- π¨ - Value Decomposition Networks (VDN).
- π₯ - Monotonic value function factorisation (QMIX).
Name | Recurrent | Continuous | Discrete | Centralised training | Communication | Multi Processing |
---|---|---|---|---|---|---|
MADQN | βοΈ | β | βοΈ | βοΈ | βοΈ | βοΈ |
DIAL | βοΈ | β | βοΈ | βοΈ | βοΈ | βοΈ |
MADDPG | βοΈ | βοΈ | βοΈ | βοΈ | β | βοΈ |
MAD4PG | βοΈ | βοΈ | βοΈ | βοΈ | β | βοΈ |
MAPPO | β | βοΈ | βοΈ | βοΈ | β | βοΈ |
VDN | β | β | βοΈ | βοΈ | β | βοΈ |
QMIX | β | β | βοΈ | βοΈ | β | βοΈ |
As we develop Mava further, we aim to have all systems well tested on a wide variety of environments.
To get a sense of how Mava systems are used we provide the following simplified example of launching a distributed MADQN system.
# Mava imports
from mava.systems.tf import madqn
from mava.components.tf.architectures import DecentralisedPolicyActor
from . import helpers
# Launchpad imports
import launchpad
# Distributed program
program = madqn.MADQN(
environment_factory=helpers.environment_factory,
network_factory=helpers.network_factory,
architecture=DecentralisedPolicyActor,
num_executors=2,
).build()
# Launch
launchpad.launch(
program,
launchpad.LaunchType.LOCAL_MULTI_PROCESSING,
)
The first two arguments to the program are environment and network factory functions.
These helper functions are responsible for creating the networks for the system, initialising their parameters on the different compute nodes and providing a copy of the environment for each executor. The next argument num_executors
sets the number of executor processes to be run.
After building the program we feed it to Launchpad's launch
function and specify the launch type to perform local multi-processing, i.e. running the distributed program on a single machine. Scaling up or down is simply a matter of adjusting the number of executor processes.
For a deeper dive, take a look at the detailed working code examples found in our examples subdirectory which show how to instantiate a few MARL systems and environments.
Mava provides several components to support the design of MARL systems such as different system architectures
and modules
. You can change the architecture to support a different form of information sharing between agents, or add a module to enhance system capabilities. Some examples of common architectures are given below.
In terms of components, you can for example update the above system code in MADQN to use a communication module by wrapping the architecture fed to the system as shown below.
from mava.components.tf.modules import communication
...
# Wrap architecture in communication module
communication.BroadcastedCommunication(
architecture=architecture,
shared=True,
channel_size=1,
channel_noise=0,
)
All modules in Mava aim to work in this way.
We have tested mava
on Python 3.6, 3.7 and 3.8.
-
Build the docker image using the following make command:
make build
For Windows, before the docker image build, we recommend to first install the package manager chocolatey and run (to install make):
choco install make
-
Run an example:
make run EXAMPLE=dir/to/example/example.py
For example,
make run EXAMPLE=examples/petting_zoo/sisl/multiwalker/feedforward/decentralised/run_mad4pg.py
. Alternatively, run bash inside a docker container with mava installed,make bash
, and from there examples can be run as follows:python dir/to/example/example.py
.To run an example with tensorboard viewing enabled, you can run
make run-tensorboard EXAMPLE=dir/to/example/example.py
and navigate to
http://127.0.0.1:6006/
. -
Install multi-agent Starcraft 2 environment [Optional]: To install the environment, please run the provided bash script, which is a slightly modified version of the script found here.
./install_sc2.sh
Or optionally install through docker (each build downloads and installs StarCraftII ~3.8G ):
make build make build_sc2
-
Install 2D RoboCup environment [Optional]: To install the environment, please run the robocup docker build command after running the Mava docker build command.
make build make build_robocup
-
If not using docker, we strongly recommend using a Python virtual environment to manage your dependencies in order to avoid version conflicts. Please note that since Launchpad only supports Linux based OSes, using a python virtual environment will only work in these cases:
python3 -m venv mava source mava/bin/activate pip install --upgrade pip setuptools
-
To install the core libraries, including Reverb - our storage dataset :
pip install id-mava pip install id-mava[reverb]
Or for nightly builds:
pip install id-mava-nightly pip install id-mava-nightly[reverb]
-
To install dependencies for tensorflow agents:
pip install id-mava[tf]
-
For distributed agent support:
pip install id-mava[launchpad]
-
To install example environments, such as PettingZoo:
pip install id-mava[envs]
-
NB: For Flatland, OpenSpiel and SMAC environments, installations have to be done separately. Flatland can be installed using:
pip install id-mava[flatland]
and for OpenSpiel, after ensuring that the right cmake and clang versions are installed as specified here:
pip install id-mava[open_spiel]
For StarCraft II installation, this must be installed separately according to your operating system. To install the StarCraft II ML environment and associated packages, please follow the instructions on PySC2 to install the StarCraft II game files. Please ensure you have the required game maps (for both PySC2 and SMAC) extracted in the StarCraft II maps directory. Once this is done you can install the packages for the single agent case (PySC2) and the multi-agent case (SMAC).
pip install pysc2 pip install git+https://github.com/oxwhirl/smac.git
-
For the 2D RoboCup environment, a local install has only been tested using the Ubuntu 18.04 operating system. The installation can be performed by running the RoboCup bash script while inside the Mava python virtual environment.
./install_robocup.sh
We also have a list of optional installs for extra functionality such as the use of Atari environments, environment wrappers, gpu support and agent episode recording.
To test and debug new system implementations, we use a simplified version of the spread environment from the MPE suite.
Debugging in MARL can be very difficult and time consuming, therefore it is important to use a small environment for debugging that is simple and fast but at the same time still able to clearly show whether a system is able to learn. An illustration of the debugging environment is shown on the right. Agents start at random locations and are assigned specific landmarks which they attempt to reach in as few steps as possible. Rewards are given to each agent independently as a function of their distance to the landmark. The reward is normalised to be between 0 and 1, where 1 is given when the agent is directly on top of the landmark. The further an agent is away from its landmark the more the reward value converges to 0. Collisions between agents result in a reward of -1 received by the colliding agents. To test both discrete and continuous control systems we feature two versions of the environment. In the discrete version the action space for each agent consists of the following five actions: left
, right
, up
, down
, stand-still
. In the continuous case, the action space consists of real values bounded between -1 and 1 for the acceleration
of the agent in the x
and y
direction. Several examples of running systems on the debugging environment can be found here. Below we show the results from some of our systems trained on the debugging environment.
We have big ambitions for Mava! π But there is still much work that needs to be done. We have a clear roadmap and wish list for expanding our system implementations and associated modules, improving testing and robustness and providing support for across-machine training. Please visit them using the links below and feel free to add your own suggestions!
In the slightly more longer term, the Mava team plans to release benchmarking results for several different systems and environments and contribute a MARL specific behavioural environment suite (similar to the bsuite for single-agent RL) specifically engineered to study aspects of MARL such as cooperation and coordination.
Please read our contributing docs for details on how to submit pull requests, our Contributor License Agreement and community guidelines.
Please read our troubleshooting and FAQs guide.
If you use Mava in your work, please cite the accompanying technical report:
@article{pretorius2021mava,
title={Mava: A Research Framework for Distributed Multi-Agent Reinforcement Learning},
author={Arnu Pretorius and Kale-ab Tessera and Andries P. Smit and Kevin Eloff
and Claude Formanek and St John Grimbly and Siphelele Danisa and Lawrence Francis
and Jonathan Shock and Herman Kamper and Willie Brink and Herman Engelbrecht
and Alexandre Laterre and Karim Beguir},
year={2021},
journal={arXiv preprint arXiv:2107.01460},
url={https://arxiv.org/pdf/2107.01460.pdf},
}