Homework2 - Policy Gradient

Please complete each homework for each team, and
mention who contributed which parts in your report.

Introduction

In this assignment, we will solve the classic control problem - CartPole.

CartPole is an environment which contains a pendulum attached by an un-actuated joint to a cart, and the goal is to prevent it from falling over. You can apply a force of +1 or -1 to the cart. A reward of +1 is provided for every timestep that the pendulum remains upright.

Setup

OpenAI gym
TensorFlow
Numpy
Scipy
IPython Notebook

If you already have some of above libraries installed, try to manage the dependencies by yourself.

If you are using a new environment (may be virtual), the preferred approach for installing above dependencies is to use Anaconda, which is a Python distribution that includes many of the most popular Python packages for science, math, engineering and data analysis.

Install Anaconda: Follow the instructions on the Anaconda download site.
Install TensorFlow: See anaconda section of TensorFlow installation page.
Install OpenAI gym: Follow the official installation documents here.

Prerequisites

If you are unfamiliar with Numpy or IPython, you should read materials from CS231n:

Also, knowing the basics of TensorFlow is required to complete this assignment.

For introductory material on TensorFlow, see

MNIST For ML Beginners from official site
Tutorial Video from Stanford CS224D

Feel free to skip these materials if you are already familiar with these libraries.

How to Start

Start IPython: After you clone this repository and install all the dependencies, you should start the IPython notebook server from the home directory
Open the assignment: Open HW2_Policy_Graident.ipynb, and it will walk you through completing the assignment.

To-Do

[+20] Construct a 2-layer neural network to represent policy
[+30] Compute the surrogate loss
[+20] Compute the accumulated discounted rewards at each timestep
[+10] Use baseline to reduce the variance
[+10] Modify the code and write a report to compare the variance and performance before and after adding baseline (with figures is better)
[+10] In function process_paths of class PolicyOptimizer, why we need to normalize the advantages? i.e., what's the usage of this line:

p["advantages"] = (a - a.mean()) / (a.std() + 1e-8)

Include the answer in your report

Other

Office hour 2-3 pm in 資電館 with YenChen Lin.
Due on Oct. 17 before class.

brade31919 / CEDL_Policy_gradient