Beronx86 / Deep-Reinforcement-Learning-Survey

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Deep Reinforcement Learning survey

This paper list is a bit different from others. I'll put some opinion and summary on it. However, to understand the whole paper, you still have to read it by yourself!
Surely, any pull request or discussion are welcomed!

Outline

Papers

  • Deep Reinforcement Learning with Double Q-learning [AAAI 2016]
    • Hado van Hasselt, Arthur Guez, David Silver
    • Deal with overestimation of Q-values
    • Separate action-select-Q and predict-Q
  • Playing Atari with Deep Reinforcement Learning [NIPS 2013 Deep Learning Workshop]
    • Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves Ioannis Antonoglou, Daan Wierstra
  • Human-level control through deep reinforcement learning, [Nature 2015]
    • Most optimization algorithms assume that the samples are independently and identically distributed, while for reinforcement learning, the data is a sequence of action, which breaks the assumption.
    • Experience replay(off-policy)
    • Iterative update Q-value
    • Source code [Torch]
  • Asynchronous Methods for Deep Reinforcement Learning [ICML 2016]
  • Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection [arXiv 2016]
  • Active Object Localization with Deep Reinforcement Learning [ICCV 2015]
  • Dueling Network Architectures for Deep Reinforcement Learning [ICML 2016]
    • Ziyu Wang, Tom Schaul, Matteo Hessel, Hado van Hasselt, Marc Lanctot, Nando de Freitas
    • Best Paper in ICML 2016
    • Pose the question: Is conventional CNN suitable for RL tasks?
    • Two stream network(state-value and advantage funvtion)
    • Focusing on innovating a neural network architecture that is better suited for model-free RL
    • Torch blog - Dueling Deep Q-Networks
  • Memory-based control with recurrent neural networks [NIPS 2015 Deep Reinforcement Learning Workshop]
    • Nicolas Heess, Jonathan J Hunt, Timothy P Lillicrap, David Silver
    • Use RNN to solve partially-observed problem
  • Control of Memory, Active Perception, and Action in Minecraft [arXiv 2016]
    • Junhyuk Oh, Valliappa Chockalingam, Satinder Singh, Honglak Lee
    • Solving problem concerning to partial observability
    • Propose mincraft task
    • Memory Q-Network (MQN), Recurrent Memory Q-Network (RMQN), and Feedback Recurrent Memory Q-Network (FRMQN)
  • Continuous Control With Deep Reinforcement Learning [ICLR 2016]
    • Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, Daan Wierstra
    • Solves the continuous control task, and avoids the curse of dimension
    • Deep version of DPG(deterministic policy gradient)
    • When going deep, some issues will happens. It's unstable to use the non-linear function to approxiamate
    • The different components of the observation may have different physical units and the ranges may vary across environments. => solve by batch normalization
    • For exploration, adding the noise to the actor policy: µ0(st) = µ(st|θt µ) + N
  • Deterministic Policy Gradient Algorithms [ICML 2014]
    • D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller
    • Highly recommended for learning policy network, and actor-critic algorithms
    • In continuous action spaces, greedy policy improvement becomes problematic, requiring a global maximisation at every step. Instead, a simple and computationally attractive alternative is to move the policy in the direction of the gradient of Q, rather than globally maximising Q
  • Mastering the game of Go with deep neural networks and tree search [Nature 2016]
    • David Silver, Aja Huang
    • First stage: supervised learning policy network, including rollout policy and SL policy network(learn the knowledge from human experts)
      • Rollout policy is used for predicting fast but relatively inaccurate decision
      • SL policy network is used for initialization of RL policy network(improved by policy gradient)
    • To prevent overfit, auto-generate the sample from self-play(half) and train with the KGS dataset(half)
    • Use Monte Carlo tree search with policy network and value network. To understand the MCTS more, plz refer to here
      • Selection: select the most promising action depends on Q+u(P) --> depth L
      • Expansion: after L steps, create a new child
      • Evaluation: evaluated by the mixture of value network and simulated rollout
      • Backup: Calculate and store the Q(s,a), N(s,a), which is used in Selection

Open Source

Python users[Tensorflow, Theano]

Lua users[Torch]

Course

Textbook

Misc

About