CartpoleActorCritic

Actor Critic Implementation in CartPole-v1 environment from OpenAI Gym

This implementation steadily reaps larger rewards, frequently hitting the maximum reward of 500 in CartPole-v1. I make use of cosine annealing (without restarts) in order to decay the learning rate over time in order to maintain model stability. The model gradient is also clipped to avoid large changes in the policy or value function. I found that using two smaller networks worked better than using one large network, one explanation I found for this was that overparameterized networks in reinforcement learning tend to cause highly unstable training. Another formulation of this idea was that the critic may have alot of noise in its parameters, particularly in early steps, as a result it would give unstable and noisy predictions to the actor resulting in an erratic policy. Another crucial change that I made was making sure to catch when a state terminates as a result of a timeout in gym rather than polcy failure, if we do not catch these states then the critic may treat otherwise good states as suboptimal and this will influence the policy to be suboptimal.

I found that the most impactful hyperparameters to tune were the learning rate, the learning rate decay rate, the entropy coefficient, and the gradient clipping parameter.

123epsilon / CartpoleActorCritic

CartpoleActorCritic

About

Languages