Paper reading Sep 2019 [3]
xysun opened this issue · comments
Xiayun Sun commented
Yes 0 paper in August. Shame :(
In September I read the 3 AlphaGo papers.
AlphaGo
- Combine tree search with policy + value network
- First an Supervised Learning policy trained from expert move, predict human moves
- A fast rollout policy trained similarly, but less features & with smaller network
- Then a self play policy gradient initialized from SL policy, train for winning
- Then a value network from the self play dataset
- Then from root, pick action with RL policy, expand leaf with SL policy, value = weighted mean of estimate from value network + MC estimate from fast rollout policy; actual action = node with most visits
- Search can be done async on CPU; while network prediction on GPU
AlphaGo Zero
- No human knowledge
- Starting from random play
- Only black & white stones as features (AlphaGo used some other features)
- Single network for both action distribution and winning prediction(unlike policy & value): (p, v) = f(s); p = vector of p(a|s); v = p(winning from current position); f = network; s = board position and history
- Inference: Simple tree search without MC rollouts
- Training: match (p, v) with (pie, z) where pie is the move probability from MCTS and z is the winner of the sample; see equation (1) for the loss function
- The MCTS rollout is same as AlphaGo paper; pie is proportional to exponentiated visit count for each move
- Evaluated residual vs convolutional & separate vs dual network; res-dual is best
AlphaZero
- Generalized to other games.
- Go: rotation/reflection invariant; binary outcome; particularly fit for conv net
- AlphaZero: no data augmentation; no hyper parameter tuning (AlphaGo Zero use Bayesian optimization)