ECEN 743: Reinforcement Learning - Assignment 1

Overview

You have to submit a report in HTML and your code to Canvas.
Put all your files (HTML report and code) into a single compressed folder named Lastname_Firstname_A1.zip.

If you are using Jupyter Notebook, you can export it in HTML by going through the top toolbar:
```
"File -> Save and Export Notebook As... -> HTML"
```
If you are using Google Colab, you might need to do some extra steps to produce an HTML report. Please Google for "how to convert ipynb notebook to HTML in Google Colab?".
This homework is self-containted in one Jupyter notebook. In your zip, we expect only your HTML report and one Jupyter notebook.

If you wish to complete this assignment locally (not on Google Colab), you need to install Jupyter Notebook. You can do
```
pip install jupyter notebook
```
In this assignment, you will play around with the famous FrozenLake environment. Please install Gymnasium (you can read more about Gymnasium here).
```
pip install gymnasium
```
It is strongly advised that you learn how to use virtual environment for Python. It creates an isolated environment from the system Python or other Python releases you have installed system-wide. It helps you manage Python packages in a clean fashion and allow you to only install necessary packages for particular projects. An exemplary, lightweight virtual environment module is venv (link). Your python distribution is likely to include it by default. If not, for example on Ubuntu, you can install it by
```
sudo apt-get install python3-venv
```

In this assignment, you will implement planning (dynamic programming) algorithms on the FrozenLake environment from Gymnasium (Link).

Q-Value Iteration (QVI): Implement Q-value iteration on the frozen lake environment.
(a). What is the optimal policy and value function?
(b). Plot $U_k = ||Q_k-Q_{k-1}||,$ where $Q_k$ is the Q-value during the $k^{\mathrm{th}}$ iteration.
(c). Use the fancy_visual function to plot the heat maps of the optimal policy and value function.
Policy Evaluation: Consider the following polices: $(i)$ the optimal policy obtained from QVI, and $(ii)$ a uniformly random policy where each action is taken with equal probability. Compute the value of the these polices using:
(a). By solving a linear systems of equations.
(b). By the iterative approach.
(c). Which method is better and why?
Policy Iteration (PI): Implement policy iteration on the frozen lake environment.
(a). What is the optimal policy and value function?
(b). Compare the convergence of QVI and PI.