dldnxks12 / DRQN-Pytorch-CartPole-v1

Deep recurrent Q learning on CartPole-v1 environment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deep Recurrent Q learning(DRQN) with Pytorch

Reference: https://arxiv.org/pdf/1507.06527.pdf

  1. Pytorch(1.5.0)
  2. Openai Gym(0.17.1)
  3. Tensorboard (2.1.0)

Training envrionment: OpenAI gym (CartPolev1)

   


POMDP

  • CartPole-v1 environment consists of the cart's position&velocity and pole's angle&velocity.  

  • I set the partially observed state as the position of cart and pole's angle. The agent has any idea of the velocity.

Stable Recurrent Updates

1. Bootstrapped Sequential Updates

  • episodes are selected randomly from the replay memory then updating stage starts at the beginning of the episode. The targets at each timestep are generated from the target Q-network. The RNN's hidden state is carried forward throughout episode.

2. Bootstrapped Random update

  • Episodes are selected randomly from the replay memory then updating stage starts at random points in the episode and proceed for only unroll iterations timesteps(lookup_step). The targets at each timestep are generated from the target Q-network. The RNN's initial state is zeroed at the start of the update.

  • The above parameters are used to set the DRQN setting. random update choose what update method to use.
  • lookup_step is how long step to observe. I found that longer lookup_step is better.

DQN with Fully Oberserved vs DQN with POMDP vs DRQN with POMDP

 

  • (orange)DQN with fully observed MDP situation can reach the highest reward.
  • (blue)DQN with POMDP never can be reached to the high reward situation.
  • (red)DRQN with POMDP can be reached the somewhat performance although it only can observe the position.

TODO

  • Random update of DRQN

About

Deep recurrent Q learning on CartPole-v1 environment


Languages

Language:Python 100.0%