Still-Rise / mlmi4-vcl

MLMI 4 - Team 1 implementation for variational continual learning

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Variational Continual Learning

Original paper by Cuong V. Nguyen, Yingzhen Li, Thang D. Bui and Richard E. Turner

Part 1. Paper Summary

1. Introduction

  • Continual Learning
  • Challenge for Continual Learning
  • Variational Continual Learning

2. Continual Learning by Approximate Bayesian Inference

  • Online updating, derived from Bayes' rule

  • Posterior after Tth dataset is proportional to the Posterior after the (T-1)th dataset multiplied by the likelihood of the Tth dataset
  • Projection Operation: approximation for intractable posterior (recursive)

  • This paper will use Online VI as it outperforms other methods for complex models in the static setting (Bui et al., 2016)

2.1. VCL and Episodic Memory Enhancement

  • Projection Operation: KL Divergence Minimization

  • : normalizing constant (not required when computing the optimum)
  • VCL becomes Bayesian inference if
  • Potential Problems
  • Errors from repeated approximation → forget old tasks
  • Minimization at each step is also approximate → information loss
  • Solution: Coreset
  • Coreset: small representative set of data from previously observed tasks
  • Analogous to episodic memory (Lopez-Paz & Ranzato, 2017)
  • Coreset VCL: equivalent to a message-passing implementation of VI in which the coreset data point updates are scheduled after updating the other data
  • : updated using and selected data points from (e.g. random selection, K-center algorithm, ...)
  • K-center algorithm: return K data points that are spread throughout the input space (Gonzalez, 1985)
  • Variational Recursion

  • Algorithm
  • Step 1: Observe
  • Step 2: Update using and
  • Step 3: Update (used for propagation)

  • Step 4: Update (used for prediction)

  • Step 5: Perform prediction

3. VCL in Deep Discriminative Models

  • Multi-head Networks
  • Standard architecture used for multi-task learning (Bakker & Heskes, 2003)
  • Share parameters close to the inputs / Separate heads for each output
  • More advanced model structures:
  • for continual learning (Rusu et al., 2016)
  • for multi-task learning in general (Swietojanski & Renals, 2014; Rebuffi et al., 2017)
  • automatic continual model building: adding new structure as new tasks are encountered
  • This paper assumes that the model structure is known a priori
  • Formulation
  • Model parameters
  • Shared parameters: updated constantly
  • Head parameter: at the beginning, updated incrementally as each task emerges
  • For simplicity, use Gaussian mean-field approximate posterior:
  • Network Training
  • Maximize the negative online variational free energy or the variational lower bound to the online marginal likelihood with respect to the variational parameters

4. VCL in Deep Generative Models

  • Deep Generative Models
  • Formulation - VAE approach (batch learning)

  • : prior over latent variables / typically Gaussian
  • : defined by DNN, , where collects weight matrices and bias vectors
  • Learning : approximate MLE (maximize variational lower bound w.r.t. and )

  • No parameter uncertainty estimates (used to weight the information learned from old data)
  • Formulation - VCL approach (continual learning)
  • Approximate full posterior over parameters:
  • Maximize full variational lower bound w.r.t. and

  • : task-specific → likely to be beneficial to share (parts of) these encoder networks
  • Model Architecture
  • Latent variables → Intermediate-level representations
  • Architecture 1: shared bottom network - suitable when data are composed of a common set of structural primitives (e.g. strokes)
  • Architecture 2: shared head network - information tend to be entirely encoded in bottom network

5. Related Work

  • Continual Learning for Deep Discriminative Models (regularized MLE)

  • ML Estimation - set
  • MAP Estimation - assume Gaussian prior and use CV to find → catastrophic forgetting
  • Laplace Propagation (LP) (Smola et al., 2004) - recursion for using Laplace's approximation
  • Diagonal LP: retain only the diagonal terms of to avoid computing full Hessian

  • Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) - modified diagonal LP
  • Approximate the average Hessian of the likelihoods using Fisher information
  • Regularization term: introduce hyperparameter, remove prior, regularize intermediate estimates

  • Synaptic Intelligence (SI) (Zenke et al., 2017) - compute using importance of each parameter to each task
  • Approximate Bayesian Training of NN (focused on )

|Approach|References| |-|-| |extended Kalman filtering|Singhal & Wu, 1989| |Laplace's approximation|MacKay, 1992| |variational inference|Hinton & Van Camp, 1993; Barber & Bishop, 1998; Graves, 2011; Blundell et al., 2015; Gal & Ghahramani, 2016| |sequential Monte Carlo|de Freitas et al., 2000| |expectation propagation|Hernández-Lobato & Adams, 2015| |approximate power EP|Hernández-Lobato et al., 2016|

  • Continual Learning for Deep Generative Models
  • Naïve approach: apply VAE to with parameters initialized at → catastrophic forgetting
  • Alternative: add EWC regularization term to VAE objective & approximate marginal likelihood by variational lower bound
  • Similar approximation can be used for Hessian matrices for LP and for SI (Importance sampling: Burda et al., 2016)

About

MLMI 4 - Team 1 implementation for variational continual learning


Languages

Language:Python 100.0%