xysun / blog

Mainly my paper reading notes

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Paper reading Sep 2019 [3]

xysun opened this issue · comments

Yes 0 paper in August. Shame :(

In September I read the 3 AlphaGo papers.

AlphaGo

  • Combine tree search with policy + value network
  • First an Supervised Learning policy trained from expert move, predict human moves
  • A fast rollout policy trained similarly, but less features & with smaller network
  • Then a self play policy gradient initialized from SL policy, train for winning
  • Then a value network from the self play dataset
  • Then from root, pick action with RL policy, expand leaf with SL policy, value = weighted mean of estimate from value network + MC estimate from fast rollout policy; actual action = node with most visits
  • Search can be done async on CPU; while network prediction on GPU

AlphaGo Zero

  • No human knowledge
  • Starting from random play
  • Only black & white stones as features (AlphaGo used some other features)
  • Single network for both action distribution and winning prediction(unlike policy & value): (p, v) = f(s); p = vector of p(a|s); v = p(winning from current position); f = network; s = board position and history
  • Inference: Simple tree search without MC rollouts
  • Training: match (p, v) with (pie, z) where pie is the move probability from MCTS and z is the winner of the sample; see equation (1) for the loss function
  • The MCTS rollout is same as AlphaGo paper; pie is proportional to exponentiated visit count for each move
  • Evaluated residual vs convolutional & separate vs dual network; res-dual is best

AlphaZero

  • Generalized to other games.
  • Go: rotation/reflection invariant; binary outcome; particularly fit for conv net
  • AlphaZero: no data augmentation; no hyper parameter tuning (AlphaGo Zero use Bayesian optimization)