Imagination-Augmented Agents for Deep Reinforcement Learning to Solve Rubik's Cubes

Jeju Deep Learning Camp 2018 [ Amlesh Sivanantham ]

To solve a Rubik's Cube environment with the model prescribed in the paper, Imagination-Augmented Agents for Deep Reinforcement Learning, [Weber et al]. The I2A model generates observation predictions from an learned environment model. I2A's learn to leverage multiple rollouts of these predicitions to construct a better policy and value network for the agent.

Imagination-Augmented Agents (I2A) Details

Full I2A Model Architecture

The Value and Policy Function leverage information from a model-free and a model-based path. The model-free path simply processed the current observation from the environment. The model-based path utilizes the observation to generate multiple rollout encoding which are aggregated and fed to the rest of the network. The rollout encodings are generated by the Imagination Core which uses a learned Environment Model to create these rollouts. These encodings can contain information about solving certain subproblems within the environment that may not yeild any reward but is beneficial for getting reward later on.

Environment Model

The Environment Model is a simple Convolutional Network that takes in the current observation of the state and action and outputs the predicted next state and reward that the real environment would produce. The input action is converted to a one-hot vector which then converted further into a one-hot channel representation and stacked alongside the other channels of the input observation. The accuracy of the model does improve the overall performance of the I2A model, but it was shown in the paper that even with a bad model, the full architecture is able to disregard the inaccuracies and still improve attain better performance when compared to a totally model-free counterpart.

Imagination Core and Encoder

The Imagination Core is responsible for generating the trajectories and encoding them. We utilize a rollout policy that makes the decision of choosing what actions to take given either the real observation state or any imagined state generated by the Environment Model. The choice of rollout policy is up for experimentation and the details of it are described in the next section. The rollout policy is used to generate n rollouts. Each rollout is then passed into a convolutional Encoder and LSTM network which outputs the encoding for the rollout.

Rollout Policy

It turns out that for the environments the paper used, distilling the full I2A network into a smaller and entirely model-free network showed the best results for being the rollout policy. This is because the policy becomes goal-oriented and thus produces rollouts that are searching for the goal.

The Environment

(Note that this is NOT the agent solving the environment)

The Rubik's Cube environment is actually a really simple environment that with only 12 possible actions (not counting mid-turns). If we include the ability to change orientation of the cube, we get a total of 18 actions. The 3x3x3 Rubik's Cube has 43,252,003,274,489,856,000 permutations making it equivalent to a very large maze. Out these permutations, there is only a single state that yeilds reward, the solved state. Theoretically, regardless of how scrambled the cube is, it can also be solved in 26 moves or less.

In our setup, the environment is setup as a gym environment using the API that OpenAI's Gym Framework provides. The reward is made to be sparse and only provides a reward of +1 on the solved state and provides a reward of +0 for all other states. While the environment supports any n-order cube, testing will only be done of the cube-x2-v0 and cube-x3-v0 environments. We will also explore how effective the policy can become when allowing the agent to control the orientation of the environment.

Results (Work in Progress)

Full Usage

usage: main.py [-h] [--a2c] [--a2c-pd-test] [--em] [--vae] [--i2a]
               [--iters ITERS] [--env ENV] [--workers WORKERS]
               [--nsteps NSTEPS] [--scramble SCRAMBLE] [--maxsteps MAXSTEPS]
               [--noise NOISE] [--adaptive] [--spectrum] [--easy]
               [--no-orient-scramble] [--a2c-arch A2C_ARCH]
               [--a2c-load A2C_LOAD] [--lr LR] [--pg-coeff PG_COEFF]
               [--vf-coeff VF_COEFF] [--ent-coeff ENT_COEFF]
               [--em-arch EM_ARCH] [--em-load EM_LOAD] [--em-loss EM_LOSS]
               [--obs-coeff OBS_COEFF] [--rew-coeff REW_COEFF]
               [--vae-arch VAE_ARCH] [--vae-load VAE_LOAD]
               [--kl-coeff KL_COEFF] [--i2a-arch I2A_ARCH]
               [--i2a-load I2A_LOAD] [--exp-root EXP_ROOT] [--exppath]
               [--tag TAG] [--log-interval LOG_INTERVAL] [--cpu CPU]
               [--no-override] [--arch-help]

optional arguments:
  -h, --help            show this help message and exit
  --a2c                 Train the Actor-Critic Agent (default: False)
  --a2c-pd-test         Test the Actor-Critic Params on a single env and show
                        policy logits (default: False)
  --em                  Train the Environment Model (default: False)
  --vae                 Train the Variational AutoEncoder Model (default:
                        False)
  --i2a                 Train the Imagination Augmented Agent (default: False)
  --iters ITERS         Number of training iterations (default: 50000.0)
  --env ENV             Environment ID (default: cube-x3-v0)
  --workers WORKERS     Set the number of workers (default: 16)
  --nsteps NSTEPS       Number of environment steps per training iteration per
                        worker (default: 40)
  --scramble SCRAMBLE   Set the max scramble size. format: size (or)
                        initial:target:episodes (default: 1)
  --maxsteps MAXSTEPS   Set the max step size. format: size (or)
                        initial:target:episodes (default: 1)
  --noise NOISE         Set the noise for observations from the environment
                        (default: 0.0)
  --adaptive            Turn on the adaptive curriculum (default: False)
  --spectrum            Setup up a spectrum of environments with different
                        difficulties (default: False)
  --easy                Make the environment extremely easy; No orientation
                        change, only R scrabmle (default: False)
  --no-orient-scramble  Lets the environment scramble orientation as well
                        (default: False)
  --a2c-arch A2C_ARCH   Specify the policy architecture, [Look at --arch-help]
                        (default: c2d+:16:3:1_h:4096:2048_pi_vf)
  --a2c-load A2C_LOAD   Load Path for the Actor-Critic Weights (default: None)
  --lr LR               Specify the learning rate to use (default: 0.0007)
  --pg-coeff PG_COEFF   Specify the Policy Gradient Loss Coefficient (default:
                        1.0)
  --vf-coeff VF_COEFF   Specify the Value Function Loss Coefficient (default:
                        0.5)
  --ent-coeff ENT_COEFF
                        Specify the Entropy Coefficient (default: 0.05)
  --em-arch EM_ARCH     Specify the environment model architecture [Look at
                        --arch-help] (default: c2d:32:3:1_c2d:64:3:1_c2d:128:3
                        :1_h:4096:2048:1024_c2dT:128:4:1_c2dT:6:3:3)
  --em-load EM_LOAD     Load Path for the Environment-Model Weights (default:
                        None)
  --em-loss EM_LOSS     Specify the loss function for training the Env Model
                        [mse,ent] (default: mse)
  --obs-coeff OBS_COEFF
                        Specify the Predicted Observation Loss Coefficient
                        (default: 1.0)
  --rew-coeff REW_COEFF
                        Specify the Predicted Reward Loss Coefficient
                        (default: 1.0)
  --vae-arch VAE_ARCH   Specify the VAE model architecture [Look at --arch-
                        help] (default: c2d:32:3:1_c2d:64:3:1_c2d:128:3:1_z:32
                        :1024_c2dT:128:4:1_c2dT:6:3:3)
  --vae-load VAE_LOAD   Load Path for the Variational AutoEncoder Weights
                        (default: None)
  --kl-coeff KL_COEFF   Specify the KL-Divergence Coefficient (default: 0.5)
  --i2a-arch I2A_ARCH   Specify the I2A policy architecture [Look at --arch-
                        help] (default: NULL)
  --i2a-load I2A_LOAD   Load Path for the Imagination-Augmented Agents Weights
                        (default: None)
  --exp-root EXP_ROOT   Set the root path for all experiments (default:
                        ./experiments/)
  --exppath             Return the experiment folder under the specified
                        arguments (default: False)
  --tag TAG             Tag the current experiemnt (default: )
  --log-interval LOG_INTERVAL
                        Set the logging interval (default: 1000.0)
  --cpu CPU             Set the number of cpu cores available (default: 16)
  --no-override         Prevent loading arguments to override default settings
                        (default: False)
  --arch-help           Show the help dialogue for constructing model
                        architectures (default: False)

Miscellaneous Utilities

The miscellaneous scripts are found in misc. They contain useful scripts which their usage will be documented here.

fuse_bucket.sh

By default the main.py program saves experiments to the ./experiments directory. The necessary files will be automatically be created in the local filesystem. But lets say you wanted to save these experiments to a Google Cloud Bucket. This script will mount the GCP Bucket into the ./experiments folder using gcsfuse. More details for Google's Cloud Storage FUSE can be found here.

kill_fuse.sh

Just a script to unmount the Google Cloud FUSE mentioned earlier.

Acknowledgements

Mentored by Kyoungmanlee, Game Contents AI Team Member at Netmarble.

This was supported by Deep Learning Camp Jeju 2018 which was organized by TensorFlow Korea User Group.

zamlz / dlcampjeju2018-I2A-cube