Collection of technical/paper notes related to reinforcement learning, with compact summary and detailed mathematical derivations.

Collection of paper notes (PDF+LaTeX) in reinforcement learning, with compact summary and focus on detailed mathematical derivations.


Derivative-Free Optimization

  • Szita et al., Learning Teris using the Noisy Cross-Entropy Method

Evolution Strategies

  • Hansen, The CMA Evolution Strategy: A Tutorial
  • Graves et al., Parameter-exploring policy gradients
  • Wierstra et al., Natural Evolution Strategies
  • Tucker et al., Guided evolutionary strategies: escaping the curse of dimensionality in random search
  • Blundell et al., PathNet: Evolution Channels Gradient Descent in Super Neural Networks


  • Barto et al., Learning Parameterized Skills
  • Nachum et al., Bridging the Gap Between Valud and Policy Based Reinforcement Learning
  • Schaul et al., Universal Value Function Approximators
  • Levine, Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
  • Srouji et al., Structured Control Nets for Deep Reinforcement Learning
  • Dai et al., SBEED: Convergent Reinforcement Learning with Nonlinear Function Approximation
  • Bellemare et al., Increasing the Action Gap: New Operators for Reinforcement Learning
  • Bengio et al., Disentangling the independently controllable factors of variation by interacting with the world
  • Braun et al., A Minimum Relative Entropy Principle for Learning and Acting
  • Silver et al., Learning Continuous Control Policies by Stochastic Value Gradients
  • Silver et al., Continuous control with deep reinforcement learning

Theory in RL

  • Bartlett et al., Infinite-Horizon Policy-Gradient Estimation
  • Fazel et al., Global Convergence of Policy Gradient Methods for Linearized Control Problems
  • Pardo et al., Time Limits in Reinforcement Learning
  • Meger et al., Addressing Function Approximation Error in Actor-Critic Methods

Policy Gradients

  • Peters et al., Reinforcement learning of motor skills with policy gradients
  • Silver et al., Asynchronous methods for Deep Reinforcement Learning (A2C)
  • Precup et al., DRL that Matters
  • Schulman, High-dimensional continuous control using generalized advantage estimation
  • Schulman, Benchmarking deep reinforcement learning for continuous control
  • Kakade, Approximately Optimal Approximate Reinforcement Learning
  • Silver et al., DPG
  • Silver et al., DDPG
  • Schulman et al., TRPO
  • Schulman et al., PPO
  • Liu et al., Stein Variational Policy Gradient
  • Gruslys et al., The Reactor
  • Wang et al., Sample Efficient Actor-Critic with Experience Reply
  • Gu et al., Q-Prop
  • Gu et al., Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning
  • Kakade, A Natural Policy Gradient


  • Silver et al., Human-level control through deep reinforcement learning
  • Silver et al. Double Q-Learning
  • Silver et al. Dueling network architectures for deep reinforcement learning
  • Silver et al. Prioritized experience replay
  • Silver et al., Rainbow: Combining Improvements in DRL
  • Bellemare et al., A Distributional Perspective on Reinforcement Learning
  • Sutton et al., A Deeper Look at Experience Replay

Model-based RL & Planning

  • Doll et al., The ubiquity of model-based reinforcement learning
  • Tamar et al., Value Iteration Networks
  • Karkus et al., QMDP-Net: Deep Learning for Planning under Partial Observability
  • Tamar et al., Learning Generalized Reactive Policies using Deep Neural Networks
  • Tamar et al., Learning Plannable Representations with Causal InfoGAN
  • Singh et al., Value Prediction Networks
  • Lin et al., Value Propagation Networks
  • Lee et al., Gated Path Planning Networks
  • Salakhutdinov et al., LSTM Iteration Networks: An Exploration of Differentialble Path Finding
  • Wierstra et al., Learning Dynamic State Abstractions for Model-Based Reinforcement Learning
  • Gal et al., Improving PILCO with Bayesian Neural Network Dynamics Models
  • Meger et al., Synthesizing Neural Network Controllers with Probabilistic Model-based Reinforcement Learning
  • Levine et al., Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
  • Wierstra et al., Learning model-based planning from scratch
  • Gu et al., Continuous Deep Q-Learning with Model-based Acceleration
  • Lecun et al., Model-Based Planning in Discrete and Continuous Actions
  • Silver et al., The Predictron: End-To-End Learning and Planning
  • Weber et al., Imagination-Augmented Agents for Deep Reinforcement Learning
  • Li et al., Iterative Linear Quadratic Regulator Design for Nonlinear Biological Movement Systems
  • Chockalingam et al., Differentiable Neural Planners with Temporally Extended Actions
  • Mishra et al., Prediction and Control with Temporal Segment Models
  • Metz et al., Discrete Sequential Prediction of Continuous Actions for Deep RL
  • Moerland et al., Learning Multimodal Transition Dynamics for Model-Based Reinforcement Learning
  • Chiappa et al., Recurrent Environment Simulators
  • Anthony et al., Thinking Fast and Slow with Deep Learning and Tree Search
  • Graves et al., Strategic Attentive Writer for Learning Macro-Actions
  • Sukhbaatar et al., Composable Planning with Attributes
  • Vinyals et al., Metacontrol for adaptive imagination-based optimization
  • Gerstner et al., Efficient Model-based Deep Reinforcement Learning with Variational State Tabulation
  • Whiteson et al., TreeQN and ATreeC: Differentiable Tree-Structured Models for Deep Reinforcement Learning
  • Abbeel et al., Universal Planning Networks
  • Dinh et al., Learning Awareness Models
  • Abbeel et al., Model-ensemble Trust-Region Policy Optimization
  • Levine et al., Model-based Value Expansion for Efficient Model-Free Reinforcement Learning
  • Levine et al., Recall Traces: Backtracking Models for Efficient Reinforcement Learning
  • Levine et al., Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning
  • Levine et al., Temporal Difference Models: Model-Free Deep RL for Model-Based Control
  • Gregor et al., Temporal Difference Variational Auto-Encoder
  • Abbeel et al., SOLAR: Deep Structured Latent Representations for Model-Based Reinforcement Learning
  • Scholkopft et al., Adaptive Skip Intervals: Temporal Abstraction for Recurrent Dynamical Models
  • Whiteson et al., Deep Variational Reinforcement Learning for POMDPs
  • Singh et al., Improving model-based RL with Adaptive Rollout using Uncertainty Estimation
  • Abbeel et al., Model-Based Reinforcement Learning via Meta-Policy Optimization

Exploration in RL

  • Osband et al., A Tutorial on Thompson Sampling (Journal version, 2018)
  • Osband et al. Deep Exploration via Bootstrapped DQN
  • Osband et al. (More) efficient reinforcement learning via posterior sampling
  • Osband et al., Why is Posterior Sampling Better than Optimism for Reinforcement Learning?
  • Abbeel et al., VIME: Variational Information Maximizing Exploration
  • Ostrovski et al., Count-Based Exploration with Neural Density Models
  • Tang et al., #Exploration: A Study of Count-based Exploration for Deep Reinforcement Learning
  • Fortunato et al., Noisy Networks for Exploration
  • Plappert et al., Parameter Space Noise for Exploration
  • Wierstra et al., Learning and Querying Fast Generative Models for Reinforcement Learning
  • Bellemare et al., Unifying Count-Based Exploration and Intrinsic Motivation
  • Levine et al., EX2: Exploration with Exemplar Models for Deep Reinforcement Learning
  • Rezende et al., Variational Information Maximisation for Intrinsically Motivated Reinforcement Learning
  • Agrawal et al., Curiosity-driven Exploration by Self-supervised Prediction
  • Moerland et al., The Potential of the Return Distribution for Exploration in RL
  • Pineau et al., Randomized Value Functions via Multiplicative Normalizing Flows
  • Abbeel et al., Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models

Hierarchical RL

  • Barto, Intrinsically motivated learning of hierarchical collections of skills
  • Sutton et al., Between MDPs and Semi-MDPs: Learning, planning, and representing knowledge at multiple temporal scales
  • Silver et al., FeUdal Networks for Hierarchical Reinforcement Learning
  • Levine et al., Data-Efficient Hierarchical Reinforcement Learning


  • Silver et al., Meta-Gradient Reinforcement Learning
  • Abbeel et al., Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments
  • Abbeel et al., Learning to Adapt: Meta-Learning for Model-Based Control
  • Schulman et al., On First-Order Meta-Learning Algorithms
  • Schaul et al., Learning to learn by gradient descent by gradient descent
  • Abbeel et al., Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
  • Botvinick et al., Learning to Reinforcement Learn
  • Abbeel et al., A Simple Neural Attentive Meta-Learner

Graph Networks in RL

  • Rezende et al., Interaction Networks for Learning about Objects, Relations and Physics
  • Wang et al., NerveNet: Learning Structured Policy with Graph Neural Networks
  • Riedmiller et al., Graph Networks as Learnable Physics Engines for Inference and Control

Curriculum/Multitask RL

  • Schmidhuber, PowerPlay: Training an increasingly general problem solver by continually searching for the simplest still unsolvable problem
  • Aljundi et al., Memory aware synapses: Learning what (not) to forget
  • French, Catastrophic forgetting in connectionist networks
  • Kirkpatrick et al., Overcoming catastrophic forgetting in neural networks
  • Rusu et al., Progressive Neural Networks
  • Rusu et al., Policy Distillation
  • Blundell et al., Memory-based Parameter Adaptation
  • Zenke et al., Continual Learning Through Synaptic Intelligence
  • Hadsell et al., Distral: Robust Multitask Reinforcement Learning
  • Silver et al., Unicorn: Continual Learning with a Universal, Off-policy Agent
  • Schulman et al., Teacher-Student Curriculum Learning
  • Abbeel et al., Reverse Curriculum Generation for Reinforcement Learning
  • Graves et al., Automated Curriculum Learning for Neural Networks
  • Bengio, Curriculum Learning
  • Clopath et al., Continual Reinforcement Learning with Complex Synapses
  • Lin et al., Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play
  • Masse et al., Alleviating catastrophic forgetting using context-dependent gating and synaptic stabilization
  • Oudeyer et al., Intrinsically Motivated Goal Exploration Processes with Automatic Curriculum Learning
  • Oudeyer et al., Unsupervised Learning of Goal Spaces for Intrinsically Motivated Goal Exploration
  • Oudeyer et al., Accuracy-based Curriculum Learning in Deep Reinforcement Learning

Neuroscience & Cognitive Science

  • Botvinick et al., The hippocampus as a predictive map
  • Botvinick, Hierarchical models of behavior and prefrontal function
  • Botvinick, Hierarchical reinforcement learning and decision making
  • Botvinick, A neural signature of hierarchical reinforcement learning
  • Botvinick, Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective
  • Botvinick et al., Reinforcement learning, efficient coding, and the statistics of natural tasks
  • Hassabis et al., Neuroscience-Inspired Artificial Intelligence
  • Tenenbaum et al., Building machines that learn and think like people
  • Doll, The ubiquity of model-based reinforcement learning
  • Morel et al., Linearization of excitatory synaptic integration at no extra cost
  • Lau et al., What is consciousness, and could machines have it?
  • Points et al., Artificial intelligence exploration of unstable protocells leads to predictable properties and discovery of collective behavior
  • Moser et al., Place cells, grid cells, and the brain's spatial representation system
  • Frank et al., Within- and across-trial dynamics of human EEG reveal cooperative interplay between reinforcement learning and working memory
  • Behrens et al., What is a cognitive map? Organising knowledge for flexible behaviour
  • Niv et al., Reinforcement learning in the brain

Deep learning

  • Dumoulin, A guide to convolution arithmetic for deep learning
  • Bottou, Stochastic Gradient Descent Tricks
  • Kingma et al., Auto-Encoding Variational Bayes
  • Hauser et al., Principles of Riemannian Geometry in Neural Networks

Optimization & Variational Inference

  • Bottou et al., Optimization Methods for Large-Scale Machine Learning
  • Martens et al., Optimizing Neural Networks with Kronecker-factored Approximate Curvature
  • Barber et al., Variational Optimization
  • Grathwohl et al., Backpropagation through the Void: Optimizing control variates for black-box gradient estimation
  • Grosse et al., Noisy Natural Gradient as Variational Inference
  • Gal et al., Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam
  • Nielsen et al., Variational Adaptive-Newton Method for Explorative Learning
  • Whiteson et al., DiCE: The Infinitely Differentiable Monte-Carlo Estimator
  • Maddison et al., The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables
  • Gu et al., Categorical Reparameterization with Gumbel-Softmax
  • Doucet et al., Hamiltonian Descent Methods
  • Bengio et al., On the Learning Dynamics of Deep Neural Networks
  • Barber et al., Stochastic Variational Optimization
  • Martens et al., New insights and perspectives on the natural gradient method

Causal inference & Reasoning & Causal RL

  • Scholkopf et al., Towards a Learning Theory of Cause-Effect Inference
  • Ziebart et al., Modeling Interaction via the Principle of Maximum Causal Entropy
  • Ziebart et al., Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy

Bayesian Neural Networks & Bayesian RL

  • Osband et al., Randomized Prior Functions for Deep Reinforcement Learning
  • Blundell et al., Weight Uncertainty in Neural Networks
  • Blundell et al., Bayesian Recurrent Neural Networks
  • Gal et al., What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision ?
  • Hernandez-Lobato et al., Black-Box alpha-Divergence Minimization
  • Blei et al., Variational Inference: A Review for Statisticians
  • Roeder et al., Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference
  • Lakshminarayanan et al., Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles
  • Henderson et al., Bayesian Policy Gradients via Alpha Divergence Dropout Inference
  • Vetrov et al., Structured Bayesian Pruning via Log-Normal Multiplicative Noise
  • Rezende et al., Neural Processes

Useful maths

  • Schon et al., Manipulating the Multivariate Gaussian Density

Waiting list

  • Tishby et al., A Unified Bellman Equation for Causal Information and Value in Markov Decision Processes
  • Botvinick et al., Learning to Share and Hide Intentions using Information Regularization
  • Walter et al., Gated Complex Recurrent Neural Networks
  • Bengio et al., Quaternion Recurrent Neural Networks
  • Bengio et al., Deep complex networks


