markov-decision-processes value-iteration policy-iteration q-learning expected-sarsa sarsa-learning dyna-q

Markov Decision Process

Problem Description

The maze environment is a 6x6 grid world which contains walls, rewards and penalties. All green squares have a reward of +1. All orange/red squares have a penalty of -1. All white squares have a reward of -0.04. All grey squares are walls and these are unreachable states. Hence the reward for walls need not be defined and has been set to 0 for this experiment.

The transition model is as follows: the intended outcome occurs with probability 0.8, and with probability 0.1 the agent moves at either right angle to the intended direction. If the move would make the agent walk into a wall, the agent stays in the same place as before. The rewards for the white squares are -0.04, for the green squares are +1, and for the brown squares are -1. Note that there are no terminal states; the agent’s state sequence is infinite.

Note

There is no terminal state. For some reinforcement learning algorithms, the epsiode ends after a certain number of steps.

Getting Started

Install Requirements

pip install -r requirements.txt

Run Code

python __main__.py --algorithm=value_iteration --display_policy=True --display_utilities=True

Choice of Algorithms

value_iteration
policy_iteration
sarsa
expected_sarsa
q_learning
monte_carlo
dyna_q

Results

Value Iteration

Policy

Utilities

Plot

Policy Iteration

Policy

Utilities

Plot

About

Solving Markov Decision Process using Value Iteration and Policy Iteration, SARSA, Expected SARSA and Q-Learning

markov-decision-processes value-iteration policy-iteration q-learning expected-sarsa sarsa-learning dyna-q

Languages

Language:Jupyter Notebook 95.9%Language:Python 4.1%