saiham6 / muzero-general-posterior

muzero-general with posterior sampling strategies implemented

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

supported platforms supported python versions dependencies status style black license MIT discord badge

ci-testing workflow

Disclaimer

This repository has deviated from the original repository by several commits and decided to create a new repository with Posterior Sampling Strategies implemented.

Please refer to the paper provided for the algorithm, results and improvements of the project.

Please refer to this repository for the Proof-of-Concept of Posterior Sampling Strategies of the project.

File Structure

  1. self_play_OG.py contains code original pUCT code for MuZero
  2. self_play_TS.py contains code for Bernoulli Thompson Sampling
  3. self_play.py contains code for Gaussian Thompson Sampling

The project is limited to 3 games namely simple_grid, tictactoe and gridworld.

Other games may not work with Gaussian Thompson Sampling unless the Fstd and g_std constants are declared for those games in the games folder.

Installation

git clone https://github.com/saiham6/muzero-general-posterior.git
cd muzero-general-posterior

pip install -r requirements.txt

Run

python muzero.py

Monitoring

To visualize the training results, run in a new terminal:

tensorboard --logdir ./results

The following documentation remains as the original.

MuZero General

A commented and documented implementation of MuZero based on the Google DeepMind paper (Schrittwieser et al., Nov 2019) and the associated pseudocode. It is designed to be easily adaptable for every games or reinforcement learning environments (like gym). You only need to add a game file with the hyperparameters and the game class. Please refer to the documentation and the example. This implementation is primarily for educational purpose.
Explanatory video of MuZero

MuZero is a state of the art RL algorithm for board games (Chess, Go, ...) and Atari games. It is the successor to AlphaZero but without any knowledge of the environment underlying dynamics. MuZero learns a model of the environment and uses an internal representation that contains only the useful information for predicting the reward, value, policy and transitions. MuZero is also close to Value prediction networks. See How it works.

Features

  • Residual Network and Fully connected network in PyTorch
  • Multi-Threaded/Asynchronous/Cluster with Ray
  • Multi GPU support for the training and the selfplay
  • TensorBoard real-time monitoring
  • Model weights automatically saved at checkpoints
  • Single and two player mode
  • Commented and documented
  • Easily adaptable for new games
  • Examples of board games, Gym and Atari games (See list of implemented games)
  • Pretrained weights available
  • Windows support (Experimental / Workaround: Use the notebook in Google Colab)

Further improvements

Here is a list of features which could be interesting to add but which are not in MuZero's paper. We are open to contributions and other ideas.

Demo

All performances are tracked and displayed in real time in TensorBoard :

cartpole training summary

Testing Lunar Lander :

lunarlander training preview

Games already implemented

  • Cartpole (Tested with the fully connected network)
  • Lunar Lander (Tested in deterministic mode with the fully connected network)
  • Gridworld (Tested with the fully connected network)
  • Tic-tac-toe (Tested with the fully connected network and the residual network)
  • Connect4 (Slightly tested with the residual network)
  • Gomoku
  • Twenty-One / Blackjack (Tested with the residual network)
  • Atari Breakout

Tests are done on Ubuntu with 16 GB RAM / Intel i7 / GTX 1050Ti Max-Q. We make sure to obtain a progression and a level which ensures that it has learned. But we do not systematically reach a human level. For certain environments, we notice a regression after a certain time. The proposed configurations are certainly not optimal and we do not focus for now on the optimization of hyperparameters. Any help is welcome.

Code structure

code structure

Network summary:

Getting started

Installation

git clone https://github.com/werner-duvaud/muzero-general.git
cd muzero-general

pip install -r requirements.lock

Run

python muzero.py

To visualize the training results, run in a new terminal:

tensorboard --logdir ./results

Config

You can adapt the configurations of each game by editing the MuZeroConfig class of the respective file in the games folder.

Related work

  • EfficientZero (Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, Yang Gao)
  • Sampled MuZero (Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Mohammadamin Barekatain, Simon Schmitt, David Silver)

Authors

Please use this bibtex if you want to cite this repository (master branch) in your publications:

@misc{muzero-general,
  author       = {Werner Duvaud, Aurèle Hainaut},
  title        = {MuZero General: Open Reimplementation of MuZero},
  year         = {2019},
  publisher    = {GitHub},
  journal      = {GitHub repository},
  howpublished = {\url{https://github.com/werner-duvaud/muzero-general}},
}

Getting involved

About

muzero-general with posterior sampling strategies implemented

License:Other


Languages

Language:Jupyter Notebook 94.9%Language:Python 5.1%