sarsa q-learning q-learning-vs-sarsa expected-sarsa double-sarsa epsilon-greedy-exploration double-expected-sarsa gradient-bandits monte-carlo-methods hacktoberfest2020 hacktoberfest rl-algorithms reinforcement-learning

Algorithm

Theory

SARSA or State-Action-Reward-State-Action is an algorithm based on on-policy TD(0) control method in reinforcement learning. It follows Generalised Policy Iteration strategy: as the policy π becomes greedy with respect to the state-action value function, the state-action value function becomes more optimal. Our aim is to estimate Qπ(s, a) for the current policy π and all state-action (s-a) pairs.

We learn the state-action value function Q(s,a) rather than state-value V(s).
Here, qπ(s,a) is the estimate for the current behavior policy π for all the state-actions pairs (s,a).
Initialising a suitable state s (s should not be a terminal state).
Choose an appropriate action A under the policy epsilon-greedy or epsilon-soft.
Record the values of the state S' and the reward R.
Update the function -> Q(S, A) ← Q(S, A) + αR + γQ(S′, A′) − Q(S, A)
This loop runs till it encounters a terminal state where Q(s',a') = 0.

SARSA update rule

Q_learning

Q-learning similar to SARSA, is based on off-policy TD(0) control method. Both the algorithms aim to estimate the Qπ(s, a) value for all the state-action pairs invlved in the task.

Q-learning Algorithm

Q-leaning vs SARSA

The only difference is that in SARSA the action a' to go from current state to the next state is selected by the same policy π (behavioral policy). Whereas in Q-learning, the action a' to go from present state to next state is selected in greedy manner, i.e., there are fewer chances of choosing a random action in a state. Hence, it involves more explotaiton than exploration.