addy1997 / RL-Algorithms

This repository has RL algorithms implemented using python

Home Page:http://Adwait1997.github.io/RL-Algorithms

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Software License Build Status Stars Contributions Lines Of Code Total alerts Code CodeFactor

Table of Contents

Algorithm

logo

Theory

SARSA or State-Action-Reward-State-Action is an algorithm based on on-policy TD(0) control method in reinforcement learning. It follows Generalised Policy Iteration strategy: as the policy π becomes greedy with respect to the state-action value function, the state-action value function becomes more optimal. Our aim is to estimate Qπ(s, a) for the current policy π and all state-action (s-a) pairs.

  • We learn the state-action value function Q(s,a) rather than state-value V(s).

  • Here, qπ(s,a) is the estimate for the current behavior policy π for all the state-actions pairs (s,a).

  • Initialising a suitable state s (s should not be a terminal state).

  • Choose an appropriate action A under the policy epsilon-greedy or epsilon-soft.

  • Record the values of the state S' and the reward R.

  • Update the function -> Q(S, A) ← Q(S, A) + αR + γQ(S′, A′) − Q(S, A)

  • This loop runs till it encounters a terminal state where Q(s',a') = 0.

SARSA update rule

logo

Q-learning similar to SARSA, is based on off-policy TD(0) control method. Both the algorithms aim to estimate the Qπ(s, a) value for all the state-action pairs invlved in the task.

Q-learning Algorithm

logo

Q-leaning vs SARSA

The only difference is that in SARSA the action a' to go from current state to the next state is selected by the same policy π (behavioral policy). Whereas in Q-learning, the action a' to go from present state to next state is selected in greedy manner, i.e., there are fewer chances of choosing a random action in a state. Hence, it involves more explotaiton than exploration.

Q-learning update rule

logo

Algorithm

algorithm