Solving the traditional CartPole problem in Reinforcement Learning with Deep NNs in the Deep Q Learning Framework
Let
The reward distribution defined on the set of all possible states and actions is denoted by by
We want to find the optimal policy
Problem Statement
Find $\pi^{} :S \rightarrow A$ such that $\forall s \in S$ $\pi^{}(s) = \max \sum_{t \geq 0} \gamma^{t}R^{t}$
Given a state,
The Q-value function $Q^{\pi} : S \times A \rightarrow \mathbb{R} $ takes this one-step further, giving the expected cumulative reward achieved by the policy
We approximate
The strategy is as folows.
- Initialise the DNN Q-value function with random weights.
- Run an agent on a (mini-batch) game, taking actions that maximise the expected reward of the given DNN Q-function. (Of course, the DNN Q-value function is not the true Q-value function)
- We fit (replay method) the DNN Q-value function it acccording to the actual reward achieved in the mini-batch of the game. In this way we train the DNN Q -value function to better approximate the actual Q value.
- Train the DNN Q-value function on mini batches so that we get suffcient trainig data to make this approximation as accurate as possible. (i.e repeat steps 2 to 4)
- Run our agent such that it takes actions that maximises the Q-value function.