Variational Continual Learning

Original paper by Cuong V. Nguyen, Yingzhen Li, Thang D. Bui and Richard E. Turner

Part 1. Paper Summary

1. Introduction

Continual Learning

Data continuously arrive in a possibly non i.i.d. way

Tasks may change over time (e.g. new classes may be discovered)

Entirely new tasks can emerge (Schlimmer & Fisher 1986; Sutton & Whitehead, 1993; Ring, 1997)

Challenge for Continual Learning

Balance between adapting to new data vs. retaining existing knowledge

Too much plasticity → catastrophic forgetting (McCloskey & Cohen, 1989; Ratcliff, 1990; Goodfellow et al., 2014a)

Too much stability → inability to adapt

Approach 1: train individual models on each task → train to combine them (Lee et al., 2017)

Approach 2: maintain a single model and use a single type of regularized training that prevents drastic changes in the influential parameters, but allow other parameters to change more freely (Li & Hoiem, 2016; Kirkpatrick et al., 2017; Zenke et al., 2017)

Variational Continual Learning

Merge online VI (Ghahramani & Attias, 2000; Sato, 2001; Broderick et al., 2013)

with Monte Carlo VI for NN (Blundell et al., 2015)

and include a small episodic memory (Bachem et al., 2015; Huggins et al., 2016)

2. Continual Learning by Approximate Bayesian Inference

Online updating, derived from Bayes' rule

$p(\boldsymbol{\theta}|\mathcal{D}_{1:T}) \propto p(\boldsymbol{\theta}) \prod^T_{t=1} \prod^{N_t}_{n_t=1} p(y_t^{(n_t)}|\boldsymbol{\theta},x_t^{(n_t)}) = p(\boldsymbol{\theta}) \prod^T_{t=1} p(\mathcal{D}_t|\boldsymbol{\theta}) \propto p(\boldsymbol{\theta}|\mathcal{D}_{1:T-1}) p(\mathcal{D}_T|\boldsymbol{\theta})$

Posterior after Tth dataset is proportional to the Posterior after the (T-1)th dataset multiplied by the likelihood of the Tth dataset

Projection Operation: approximation for intractable posterior (recursive)

$\begin{align*} p(\boldsymbol{\theta}|\mathcal{D}_1) \approx q_1(\boldsymbol{\theta}) = \text{proj}(p(\boldsymbol{\theta})p(\mathcal{D}1|\boldsymbol{\theta})) \ p(\boldsymbol{\theta}|\mathcal{D}{1:T}) \approx q_T(\boldsymbol{\theta}) &= \text{proj}(q_{T-1}(\boldsymbol{\theta})p(\mathcal{D}_T|\boldsymbol{\theta})) \end{align}$

This paper will use Online VI as it outperforms other methods for complex models in the static setting (Bui et al., 2016)

2.1. VCL and Episodic Memory Enhancement

Projection Operation: KL Divergence Minimization

$q_t(\boldsymbol{\theta}) = \underset{q \in \mathcal{Q}}{\text{argmin}} \text{KL} \left( q(\boldsymbol{\theta}) || \frac{1}{Z_t} q_{t-1}(\boldsymbol{\theta}) p(\mathcal{D}_t|\boldsymbol{\theta}) \right)$

$q_0(\boldsymbol{\theta}) = p(\boldsymbol{\theta})$

: normalizing constant (not required when computing the optimum)

VCL becomes Bayesian inference if $p(\boldsymbol{\theta}|\mathcal{D}_{1:t}) \in \mathcal{Q} \;\forall\; t$

Potential Problems

Errors from repeated approximation → forget old tasks

Minimization at each step is also approximate → information loss

Solution: Coreset

Coreset: small representative set of data from previously observed tasks

Analogous to episodic memory (Lopez-Paz & Ranzato, 2017)

Coreset VCL: equivalent to a message-passing implementation of VI in which the coreset data point updates are scheduled after updating the other data

: updated using $C_{t-1}$ and selected data points from $\mathcal{D}_t$ (e.g. random selection, K-center algorithm, ...)

K-center algorithm: return K data points that are spread throughout the input space (Gonzalez, 1985)

Variational Recursion

$p(\boldsymbol{\theta}|\mathcal{D}_{1:t}) \propto p(\boldsymbol{\theta}|\mathcal{D}_{1:t} \setminus C_t) p(C_t|\boldsymbol{\theta}) \approx \tilde{q}_t (\boldsymbol{\theta}) p(C_t|\boldsymbol{\theta})$

$p(\boldsymbol{\theta}|\mathcal{D}_{1:t} \setminus C_t) = p(\boldsymbol{\theta}|\mathcal{D}_{1:t-1} \setminus C_{t-1}) p(C_{t-1} \setminus C_t | \boldsymbol{\theta}) p(\mathcal{D}_t \setminus C_t | \boldsymbol{\theta}) \approx \tilde{q}_{t-1}(\boldsymbol{\theta}) p(\mathcal{D}_t \cup C_{t-1} \setminus C_t | \boldsymbol{\theta})$

Algorithm

Step 1: Observe $\mathcal{D}_t$

Step 2: Update using $C_{t-1}$ and $\mathcal{D}_t$

Step 3: Update $\tilde{q}_t$ (used for propagation)

$\begin{align*} \tilde{q}_t(\boldsymbol{\theta}) &= \text{proj} \left( \tilde{q}_{t-1}(\boldsymbol{\theta}) p(\mathcal{D}_t \cup C_{t-1} \setminus C_t | \boldsymbol{\theta}) \right) \\ &= \underset{q \in \mathcal{Q}}{\text{argmin}} \; \text{KL} \left( q(\boldsymbol{\theta}) \;\big|\big|\; \frac{1}{\tilde{Z}} \tilde{q}_{t-1}(\boldsymbol{\theta}) p(\mathcal{D}_t \cup C_{t-1} \setminus C_t |\boldsymbol{\theta}) \right) \end{align*}$

Step 4: Update (used for prediction)

$\begin{align*} q_t(\boldsymbol{\theta}) &= \text{proj} \left( \tilde{q}_{t}(\boldsymbol{\theta}) p(C_t | \boldsymbol{\theta}) \right) \\ &= \underset{q \in \mathcal{Q}}{\text{argmin}} \; \text{KL} \left( q(\boldsymbol{\theta}) \;\big|\big|\; \frac{1}{Z} \tilde{q}_t (\boldsymbol{\theta}) p(C_t |\boldsymbol{\theta}) \right) \end{align*}$

Step 5: Perform prediction

$p(y^*|\boldsymbol{x}^*, \mathcal{D}_{1:t}) = \int q_t(\boldsymbol{\theta}) p(y^*|\boldsymbol{\theta},\boldsymbol{x}^*) d\boldsymbol{\theta}$

3. VCL in Deep Discriminative Models

Multi-head Networks

Standard architecture used for multi-task learning (Bakker & Heskes, 2003)

Share parameters close to the inputs / Separate heads for each output

More advanced model structures:

for continual learning (Rusu et al., 2016)

for multi-task learning in general (Swietojanski & Renals, 2014; Rebuffi et al., 2017)

automatic continual model building: adding new structure as new tasks are encountered

This paper assumes that the model structure is known a priori

Formulation

Model parameters $\bm{\theta} = \{ \bm{\theta}^H_{1:T}, \bm{\theta}^S \} \in \mathbb{R}^D$

Shared parameters: updated constantly

Head parameter: $q(\bm{\theta}^H_K) = p(\bm{\theta}^H_K)$ at the beginning, updated incrementally as each task emerges

For simplicity, use Gaussian mean-field approximate posterior: $q_t(\bm{\theta}) = \prod^D_{d=1} \mathcal{N} (\theta_{t,d} ; \mu_{t,d}, \sigma^2_{t,d})$

Network Training

Maximize the negative online variational free energy or the variational lower bound to the online marginal likelihood $\mathcal{L}^t_{VCL}$ with respect to the variational parameters $\{\mu_{t,d},\sigma_{t,d}\}^D_{d=1}$

$\mathcal{L}^t_{VCL} (q_t(\boldsymbol{\theta})) = \sum^{N_t}_{n=1} \mathbb{E}_{\boldsymbol{\theta} \sim q_t(\boldsymbol{\theta})} \left[ \log p(y_t^{(n)}|\boldsymbol{\theta},\mathbf{x}^{(n)}_t) \right] - \text{KL} (q_t(\boldsymbol{\theta})||q_{t-1}(\boldsymbol{\theta}))$

$\text{KL} (q_t(\bm{\theta})||q_{t-1}(\bm{\theta}))$ : tractable / set $q_0(\mathbf{\theta})$ as multivariate Gaussian (Graves, 2011; Blundell et al., 2015)

$\mathbb{E}_{\bm{\theta} \sim q_t(\bm{\theta})} [\cdot]$ : intractable → approximate by employing simple Monte Carlo and using the local reparameterization trick to compute the gradients (Salimans & Knowles, 2013; Kingma & Welling, 2014; Kingma et al., 2015)

4. VCL in Deep Generative Models

Deep Generative Models

Can generate realistic images, sounds, and video sequences (Chung et al., 2015; Kingma et al., 2016; Vondrick et al., 2016)

Standard batch learning assumes observations to be i.i.d. and are all available at the same time

This paper applies VCL framework to variational auto encoders (Kingma & Welling, 2014; Rezende et al., 2014)

Formulation - VAE approach (batch learning)

$p(\mathbf{x}|\mathbf{z},\boldsymbol{\theta}) p(\mathbf{z})$

$p(\mathbf{z})$ : prior over latent variables / typically Gaussian

$p(\mathbf{x}\lvert\mathbf{z},\mathbf{\theta})$ : defined by DNN, $\mathbf{f_\theta} (\mathbf{z})$ , where $\mathbf{\theta}$ collects weight matrices and bias vectors

Learning $\mathbf{\theta}$ : approximate MLE (maximize variational lower bound w.r.t. $\mathbf{\theta}$ and $\mathbf{\phi}$ )

$\mathcal{L}_{\text{VAE}} (\mathbf{\theta},\mathbf{\phi}) = \sum^N_{n=1} \mathbb{E}_{q_\mathbf{\phi}(\mathbf{z}^{(n)}|\mathbf{x}^{(n)})} \left[ \log \frac{p(\mathbf{x}^{(n)}|\mathbf{z}^{(n)},\mathbf{\theta})p(\mathbf{z}^{(n)})} {q_\mathbf{\phi} (\mathbf{z}^{(n)}|\mathbf{x}^{(n)})} \right]$

No parameter uncertainty estimates (used to weight the information learned from old data)

Formulation - VCL approach (continual learning)

Approximate full posterior over parameters: $q_t(\mathbf{\theta}) \approx p(\mathbf{\theta}|\mathcal{D}_{1:t})$

Maximize full variational lower bound w.r.t. and $\phi$

$\mathcal{L}^t_{\text{VAE}} (q_t(\mathbf{\theta}),\mathbf{\phi}) = \mathbb{E}_{q_t(\mathbf{\theta})}\left\{ \sum^{N_t}_{n=1} \mathbb{E}_{q_\mathbf{\phi}(\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})} \left[ \log \frac{p(\mathbf{x}_t^{(n)}|\mathbf{z}_t^{(n)},\mathbf{\theta})p(\mathbf{z}_t^{(n)})} {q_\mathbf{\phi} (\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})} \right] \right\} -\text{KL}(q_t(\mathbf{\theta})||q_{t-1}(\mathbf{\theta}))$

$\mathbf{\phi}$ : task-specific → likely to be beneficial to share (parts of) these encoder networks

Model Architecture

Latent variables $\mathbf{z}$ → Intermediate-level representations $\mathbf{h}$ \rightarrow Observations\ $\mathbf{x}$

Architecture 1: shared bottom network - suitable when data are composed of a common set of structural primitives (e.g. strokes)

Architecture 2: shared head network - information tend to be entirely encoded in bottom network

5. Related Work

Continual Learning for Deep Discriminative Models (regularized MLE)

$\mathcal{L}^t (\boldsymbol{\theta}) = \sum^{N_t}_{n=1} \log p(y_t^{(n)} | \boldsymbol{\theta},\mathbf{x}^{(n)}_t) - \frac{1}{2} \lambda_t (\boldsymbol{\theta} - \boldsymbol{\theta}_{t-1})^T \Sigma^{-1}_{t-1} (\boldsymbol{\theta} - \boldsymbol{\theta}_{t-1})$

ML Estimation - set $\lambda_t = 0$

MAP Estimation - assume Gaussian prior $q(\mathbf{\theta}|\mathcal{D}_{1:t-1})=\mathcal{N}(\mathbf{\theta};\mathbf{\theta}_{t-1},\Sigma_{t-1}/\lambda_t)$ > * $\Sigma_t=? \; \rightarrow \; \Sigma_t=I$ and use CV to find $\lambda_T$ → catastrophic forgetting

Laplace Propagation (LP) (Smola et al., 2004) - recursion for $\Sigma_t$ using Laplace's approximation

Diagonal LP: retain only the diagonal terms of $\Sigma^{-1}_t$ to avoid computing full Hessian

$\Sigma^{-1}_t = \Phi_t + \Sigma^{-1}_{t-1} \;\;\;,\;\;\; \Phi_t = - \nabla \nabla_\mathbf{\theta} \sum^{N_t}_{n=1} \log p(y^{(n)}_t | \mathbf{\theta}, \mathbf{x}^{(n)}_t) \big|_{\mathbf{\theta} = \mathbf{\theta}_t} \;\;\;,\;\;\; \lambda_t = 1$

Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) - modified diagonal LP

Approximate the average Hessian of the likelihoods using Fisher information $\Phi_t \approx \text{diag} \left( \sum^{N_t}_{n=1} \left( \nabla_\mathbf{\theta} \log p(y^{(n)}_t|\mathbf{\theta},\mathbf{x}^{(n)}_t) \right)^2 \;\Big|_{\mathbf{\theta}=\mathbf{\theta}_t}\right)$

Regularization term: introduce hyperparameter, remove prior, regularize intermediate estimates

$\frac{1}{2} (\boldsymbol{\theta} - \boldsymbol{\theta}_{t-1})^T (\Sigma^{-1}_0 + \Sigma^{t-1}_{t'=1} \Phi_{t'}) (\boldsymbol{\theta} - \boldsymbol{\theta}_{t-1}) \rightarrow \frac{1}{2} \sum^{t-1}_{t'=1} \lambda_{t'} (\boldsymbol{\theta} - \boldsymbol{\theta}_{t'-1})^T \Phi_{t'} (\boldsymbol{\theta} - \boldsymbol{\theta}_{t'-1})$

Synaptic Intelligence (SI) (Zenke et al., 2017) - compute $\Sigma^{-1}_t$ using importance of each parameter to each task

Approximate Bayesian Training of NN (focused on )

|Approach|References| |-|-| |extended Kalman filtering|Singhal & Wu, 1989| |Laplace's approximation|MacKay, 1992| |variational inference|Hinton & Van Camp, 1993; Barber & Bishop, 1998; Graves, 2011; Blundell et al., 2015; Gal & Ghahramani, 2016| |sequential Monte Carlo|de Freitas et al., 2000| |expectation propagation|Hernández-Lobato & Adams, 2015| |approximate power EP|Hernández-Lobato et al., 2016|

Continual Learning for Deep Generative Models

Naïve approach: apply VAE to $\mathcal{D}_t$ with parameters initialized at $\mathbf{\theta}_{t-1}$ → catastrophic forgetting

Alternative: add EWC regularization term to VAE objective & approximate marginal likelihood by variational lower bound

Similar approximation can be used for Hessian matrices for LP and → $\Sigma^{-1}_t$ for SI (Importance sampling: Burda et al., 2016)

$\mathcal{L}^t_{EWC} (\mathbf{\theta},\mathbf{\phi}) = \sum^{N_t}_{n=1} \mathbb{E}_{q_\mathbf{\phi}(\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})} \left[ \log \frac{p(\mathbf{x}_t^{(n)}|\mathbf{z}_t^{(n)},\boldsymbol{\theta})p(\mathbf{z}_t^{(n)})} {q_\mathbf{\phi} (\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})} \right] - \frac{1}{2} \sum^{t-1}_{t'=1} \lambda_{t'} (\mathbf{\theta} - \mathbf{\theta}_{t'-1})^T \Phi_{t'} (\mathbf{\theta} - \mathbf{\theta}_{t'-1})$

$\Phi_t \approx \text{diag} \left( \sum^{N_t}_{n=1} \left( \nabla_\mathbf{\theta} \mathbb{E}_{q_{\mathbf{\phi}}(\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})} \left[ \log \frac{p(\mathbf{x}_t^{(n)}|\mathbf{z}_t^{(n)},\mathbf{\theta})p(\mathbf{z}_t^{(n)})} {q_\mathbf{\phi} (\mathbf{z}_t^{(n)}|\mathbf{x}_t^{(n)})} \right] \right)^2 \;\Bigg|_{\mathbf{\theta}=\mathbf{\theta}_t}\right)$

Still-Rise / mlmi4-vcl

Variational Continual Learning

Part 1. Paper Summary

1. Introduction

2. Continual Learning by Approximate Bayesian Inference

2.1. VCL and Episodic Memory Enhancement

3. VCL in Deep Discriminative Models

4. VCL in Deep Generative Models

5. Related Work

About

Languages