dpbycicle/Study-Reinforcement-Learning

Study Reinforcement Learning & (Deep RL) Guide:

Simple guide and collective to study RL/DeepRL in one to 2.5 months of time.

Talks to check out first:

Introduction to Reinforcement Learning by Joelle Pineau, McGill University:
- Applications of RL.
- When to use RL?
- RL vs supervised learning
- What is MDP? Markov Decision Process
- Components of an RL agent:
  - states
  - actions (Probabilistic effects)
  - Reward function
  - Initial state distribution
- Explanation of the Markov Property:
- Why Maximizing utility in:
  - Episodic tasks
  - Continuing tasks
    - The discount factor, gamma γ
- What is the policy & what to do with it?
  - A policy defines the action-selection strategy at every state:
- Value functions:
  - The value of a policy equations are (two forms of) Bellman’s equation.
  - (This is a dynamic programming algorithm).
  - Iterative Policy Evaluation:
    - Main idea: turn Bellman equations into update rules.
- Optimal policies and optimal value functions.
  - Finding a good policy: Policy Iteration (Check the talk Below By Peter Abeel)
  - Finding a good policy: Value iteration
    - Asynchronous value iteration:
    - Instead of updating all states on every iteration, focus on important states.
- Key challenges in RL:
  - Designing the problem domain
    - State representation – Action choice – Cost/reward signal
  - Acquiring data for training – Exploration / exploitation – High cost actions – Time-delayed cost/reward signal
  - Function approximation
  - Validation / confidence measures
- The RL lingo.
- In large state spaces: Need approximation:
  - Fitted Q-iteration:
    - Use supervised learning to estimate the Q-function from a batch of training data:
    - Input, Output and Loss.
      - i.e: The Arcade Learning Environment
- Deep Q-network (DQN) and tips.
Deep Reinforcement Learning
- Why Policy Optimization?
- Cross Entropy Method (CEM) / Finite Differences / Fixing Random Seed
- Likelihood Ratio (LR) Policy Gradient
- Natural Gradient / Trust Regions (-> TRPO)
- Actor-Critic (-> GAE, A3C)
- Path Derivatives (PD) (-> DPG, DDPG, SVG)
- Stochastic Computation Graphs (generalizes LR / PD)
- Guided Policy Search (GPS)
- Inverse Reinforcement Learning
  - Inverse RL vs. behavioral cloning

Books:

Courses:

Reinforcement Learning by David Silver.
- Lecture 1: Introduction to Reinforcement Learning
- Lecture 2: Markov Decision Processes
- Lecture 3: Planning by Dynamic Programming
- Lecture 4: Model-Free Prediction
- Lecture 5: Model-Free Control
- Lecture 6: Value Function Approximation
- Lecture 7: Policy Gradient Methods
- Lecture 8: Integrating Learning and Planning
- Lecture 9: Exploration and Exploitation
- Lecture 10: Case Study: RL in Classic Games
CS294 Deep Reinforcement Learning by John Schulman and Pieter Abbeel.
- Lecture 1: intro, derivative free optimization
- Lecture 2: score function gradient estimation and policy gradients
- Lecture 3: actor critic methods
- Lecture 4: trust region and natural gradient methods, open problems

dpbycicle / Study-Reinforcement-Learning

Study Reinforcement Learning & (Deep RL) Guide:

Talks to check out first:

Books:

Courses:

About