Deep-Generative-Models-for-Natural-Language-Processing

DGMs 4 NLP, Deep Generative Models for Natural Language Processing, resources, conference mapping and paper list.

Yao Fu, Columbia University, yao.fu@columbia.edu

**Update**: Advanced Neural Architectures; Monte-Carlo Gradient Estimation; Continuous Relaxations

**TODO**: Non-autoregressive Generation; Decoding methods; Score-based Generative Models; A* sampling; Contrastive Divergence; EBM; continuous relaxation of discrete structures; optimization for discrete structures; Langevin Dynamics; Posterior Regularization

Why do we want deep generative models? Because we want to learn the latent representations for language. Human language contains rich latent factors, the continuous ones might be emotion, intention, and others, the discrete/ structural factors might be POS/ NER tags or syntax trees. They are latent since we just observe the sentence. They are also generative: human should produce language based on the overall idea, the current emotion, the syntax, and all other things we can or cannot name.

How to model them in a statistically principled way? Can we have a flexible framework that allows us to incorporate explicit supervision signals when we have labels, or add distant supervision or logical/ statistical constraints when we do not have labels but have other prior knowledge, or simply infer whatever makes the most sense when we have no labels or a priori? Is it possible that we exploit the modeling power of advanced neural architectures while still being mathematical and probabilistic? DGMs allow us to achieve these goals.

Let us begin the journey.

Citation:

@article{yao2019DGM4NLP,
  title   = "Deep Generative Models for Natual Language Processing",
  author  = "Yao Fu",
  year    = "2018",
  url     = "https://github.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing"
}

Resources

Deep Generative Models

♦︎ John's DGM: Columbia STAT 8201, Deep Generative Models, by John Cunningham
- The DGM seminar at Columbia. The first part of this course focuses on VAEs and the second part focuses on GANs.
- The discussion about wesserstein GANs is amazing. Do take a look.
♦︎ Sasha's tutorial: A Tutorial on Deep Latent Variable Models of Natural Language (link), EMNLP 18
- Yoon Kim, Sam Wiseman and Alexander M. Rush, Havard
Wilker Aziz's DGM Landscape and their tutorial
- A great guidebook for VI. A graph over the VI literature and discusses the connections of different techniques.
Stanford CS 236, Deep Generative Models (link)
NYU Deep Generative Models (link)
U Toronto CS 2541 Differentiable Inference and Generative Models, CS 2547 Learning Discrete Latent Structures.
Berkeley CS294-158 Deep Unsupervised Learning.
Columbia STCS 8101 Representation Learning: A Probabilistic Perspective

Graphical Models Foundations

The fundation of the DGMs is built upon probabilistic graphical models. So we take a look at the following resources

Blei's Foundation of Graphical Models course, STAT 6701 at Columbia (link)
- Foundation of probabilistic modeling, graphical models, and approximate inference.
Xing's Probabilistic Graphical Models, 10-708 at CMU (link)
- A really heavy course with extensive materials.
- 5 modules in total: exact inference, approximate inference, DGMs, reinforcement learning, and non-parameterics.
- All the lecture notes, vedio recordings, and homeworks are open-sourced.
♦︎ Collins' Natural Language Processing, COMS 4995 at Columbia (link)
- Many inference methods for structured models are introduced. Also take a look at related notes from Collins' homepage
- Also checkout bilibili

Textbooks and Phd Thesis

♦︎ Pattern Recognition and Machine Learning. Christopher M. Bishop. 2006
- Probabily the most classical textbook
- The core part, according to my own understanding, of this book, should be section 8 - 13, especially section 10 since this is the section that introduces variational inference.
Machine Learning: A Probabilistic Perspective. Kevin P. Murphy. 2012
- Compared with the PRML Bishop book, this book may be used as a super-detailed handbook for various graphical models and inference methods.
Deep Generative Models for Natural Language Processing. (link)
- Yishu Miao, Oxford, 2017
Deep Latent Variable Models for Natural Language (link)
- Yoon Kim, Havard, 2020

NLP Side

We will focus on two topics: generation and structural inference, and the advanced neural network architectures for them. We start from generation

Generation

♦︎ Generating Sentences from a Continuous Space, CoNLL 15
- Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Jozefowicz, Samy Bengio
- Seems to be the first paper using VAEs for NLP
- An important point of this paper is about the posterior collapse problems, which has many follow-ups
Neural variational inference for text processing, ICML 16
- Yishu Miao, Lei Yu, Phil Blunsom, Deepmind
Learning Neural Templates for Text Generation. EMNLP 2018
- Sam Wiseman, Stuart M. Shieber, Alexander Rush. Havard
Residual Energy Based Models for Text Generation. ICLR 20
- Yuntian Deng, Anton Bakhtin, Myle Ott, Arthur Szlam, Marc' Aurelio Ranzato. Havard and FAIR
♦︎ Cascaded Text Generation with Markov Transformers. Arxiv 20
- Yuntian Deng and Alexander Rush
Paraphrase Generation with Latent Bag of Words. NeurIPS 2019.
- Yao Fu, Yansong Feng, and John P. Cunningham. Columbia
- Learning bag of words as discrete latent variables, differentiable subset sampling via gumbel-topk reparameterization.
♦︎ Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement. ICML 19
- Wouter Kool, Herke van Hoof, Max Welling
- Gumbel topk, stochastic differentiable beam search

Structured Prediction

Structured Prediction is about the so-called core-nlp tasks like chunking, tagging and parsing and so on.

A good start point is Sasha's library, TorchStruct, as it is an integration of multiple core and advanced techniques.

♦︎♦︎ Torch-Struct: Deep Structured Prediction Library
- Alexander M. Rush. Cornell University
- github, paper, documentation
- Instantiate different CRFs with different Semirings. The backward part of inference algorithms are implemented with Autograd. Sasha implmented all these stuff alone, including the CUDA codes.
♦︎ An introduction to Conditional Random Fields. Charles Sutton and Andrew McCallum. 2012
- Linear-chain CRFs. Modeling, inference and parameter estimation
♦︎ Inside-Outside and Forward-Backward Algorithms Are Just Backprop. Jason Eisner. 2016.
- The relationships between CRF inference and Autograd.
♦︎ Structured Attention Networks. ICLR 2017
- Yoon Kim, Carl Denton, Luong Hoang, Alexander M. Rush
- Structured attention w. linear chain and tree crfs.
Differentiable Dynamic Programming for Structured Prediction and Attention. Arthur Mensch and Mathieu Blondel. ICML 2018
- To differentiate the max operator in dynamic programming.
Recurrent Neural Network Grammars. NAACL 16
- Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah Smith.
- A transaction based generative model to model the joint prob of trees and sentences.
- Inference: use importance sampling to calculate the sentence marginal prob. Use the discriminative model as the proposal dist.
Unsupervised Recurrent Neural Network Grammars, NAACL 19
- Yoon Kin, Alexander Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gabor Melis
Semantic Parsing with Semi-Supervised Sequential Autoencoders. 2016
- Tomas Kocisky, Gabor Melis, Edward Grefenstette, Chris Dyer, Wang Ling, Phil Blunsom, Karl Moritz Hermann
Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder, ICLR 19
- Caio Corro, Ivan Titov, Edinburgh
- Reparameterize the sampling from a CRF by using gumbel perturbation and continuous relexation of Eisner Algo.

Advanced Neural Network Architectures

THUNLP: Pre-trained Languge Model paper list (link)
- Xiaozhi Wang and Zhengyan Zhang, Tsinghua University
Ordered Neurons: Integrating Tree Structured into Recurrent Neural Networks
- Yikang Shen, Shawn Tan, Alessandro Sordoni, Aaron Courville. Mila, MSR
♦︎ Cascaded Text Generation with Markov Transformers. Arxiv 20
- Yuntian Deng and Alexander Rush

ML Side

Now the ML side, before discussing VAEs, GANs and Flows, we first review MCMC and VI, as the two most widely used approximate inference methods

Samplig Methods

♦︎ Probabilistic inference using Markov chain Monte Carlo methods. 1993
- Radford M Neal
- Markov Chains; Gibbs Sampling; Metropolis-Hastings
Elements of Sequential Monte Carlo (link)
- Christian A. Naesseth, Fredrik Lindsten, Thomas B. Schön
A Conceptual Introduction to Hamiltonian Monte Carlo (link)
- Michael Betancourt
Candidate Sampling (link)
- Google Tensorflow Blog
Noise-constrastive estimation: A new estimation principle for unnormalized statistical models. AISTATA 2010
- Michael Gutmann, Hyvarinen. University of Helsinki

Variational Inference, VI

♦︎ Cambridge Variational Inference Reading Group (link)
- Sam Power. University of Cambridge
♦︎ Variational Inference: A Review for Statisticians.
- David M. Blei, Alp Kucukelbir, Jon D. McAuliffe.
- Mean-field variational family; coordinate ascent algorithm; bayesian mixture of gaussians; VI w. exponential families.
Stochastic Variational Inference
- Matthew D. Hoffman, David M. Blei, Chong Wang, John Paisley
- Natural gradient of the ELBO; stochastic optimization; bayesian non-parameterics for the hierarchical dirichlet process.
Variational Bayesian Inference with Stochastic Search. ICML 12
- John Paisley, David Blei, Michael Jordan. Berkeley and Princeton

VAEs

♦︎ Auto-Encoding Variational Bayes, ICLR 14
- Diederik P. Kingma, Max Welling
Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML 14
- Danilo Jimenez Rezende, Shakir Mohamed, Daan Wierstra
- Reparameterization w. deep gaussian models.
Semi-amortized variational autoencoders, ICML 18
- Yoon Kim, Sam Wiseman, Andrew C. Miller, David Sontag, Alexander M. Rush, Havard
Adversarially Regularized Autoencoders, ICML 18
- Jake (Junbo) Zhao, Yoon Kim, Kelly Zhang, Alexander M. Rush, Yann LeCun. NYU, Havard, FAIR
- A wrapup of the major VAE/ GANs
- The presentation of this paper at the Columbia DGM seminar course.

Reparameterization

More on reparameterization: to reparameterize gaussian mixture, permutation matrix, and rejection samplers(Gamma and Dirichlet).

Stochastic Backpropagation through Mixture Density Distributions, Arxiv 16
- Alex Graves
- To reparameterize Gaussian Mixture
Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms. AISTATS 2017
- Christian A. Naesseth, Francisco J. R. Ruiz, Scott W. Linderman, David M. Blei
Implicit Reparameterization Gradients. NeurIPS 2018.
- Michael Figurnov, Shakir Mohamed, and Andriy Mnih
- Really smart way to reparameterize many complex distributions.
Categorical Reparameterization with Gumbel-Softmax. ICLR 2017
- Eric Jang, Shixiang Gu, Ben Poole
♦︎ The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR 2017
- Chris J. Maddison, Andriy Mnih, and Yee Whye Teh
Invertible Gaussian Reparameterization: Revisiting the Gumbel-Softmax. 2020
- Andres Potapczynski, Gabriel Loaiza-Ganem, John P. Cunningham
Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019
- Sang Michael Xie and Stefano Ermon

GANs

Generative Adversarial Networks, NIPS 14
- Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
- GAN origin
- This original GAN paper use the KL divergence to measure the distance between probability distributions, which may lead to the vanishing of gradient. To tackle this problem, the wassertein GAN is proposed with the earch mover distance. The following two papers shows the birth of wGAN.
Towards principled methods for training generative adversarial networks, ICLR 2017
- Martin Arjovsky and Leon Bottou
- Discusses the distance between distributions, but uses many hacky methods.
♦︎ Wasserstein GAN
- Martin Arjovsky, Soumith Chintala, Léon Bottou
- The principled methods, born from hacky methods.
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. NIPS 2016
- Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel. UC Berkeley. OpenAI
- Variational mutual information maximization; unsupervised disentangled representation learning.

Flows

♦︎ Flow Based Deep Generative Models, from Lil's log
Variational Inference with Normalizing Flows, ICML 15
- Danilo Jimenez Rezende, Shakir Mohamed
Improved Variational Inference with Inverse Autoregressive Flow
- Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, Max Welling
Density estimation using Real NVP. ICLR 17
- Laurent Dinh, Jascha Sohl-Dickstein, Samy Bengio
Learning About Language with Normalizing Flows
- Graham Neubig, CMU, slides
Latent Normalizing Flows for Discrete Sequences. ICML 2019.
- Zachary M. Ziegler and Alexander M. Rush
Discrete Flows: Invertible Generative Models of Discrete Data. 2019
- Dustin Tran, Keyon Vafa, Kumar Krishna Agrawal, Laurent Dinh, Ben Poole

Advanced-Topics

Gradient Estimation and Optimization

♦︎ Monte Carlo Gradient Estimation in Machine Learning
- Schakir Mohamed, Mihaela Rosca, Michael Figurnov, Andriy Mnih. DeepMind
Variational Inference for Monte Carlo Objectives. ICML 16
- Andriy Mnih, Danilo J. Rezende. DeepMind
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. NIPS 17
- George Tucker, Andriy Mnih, Chris J. Maddison, Dieterich Lawson, Jascha Sohl-Dickstein. Google Brain, DeepMind, Oxford
♦︎ Backpropagation Through the Void: Optimizing Control Variates for Black-box Gradient Estimation. ICLR 18
- Will Grathwohl, Dami Choi, Yuhuai Wu, Geoffrey Roeder, David Duvenaud. U Toronto and Vector Institute

Continuous Relexation of Discrete Structures

♦︎♦︎ Gradient Estimation with Stochastic Softmax Tricks. 2020
- Max B. Paulus, Dami Choi, Daniel Tarlow, Andreas Krause, Chris J. Maddison.
Differentiable Dynamic Programming for Structured Prediction and Attention. ICML 18
- Arthur Mensch, Mathieu Blondel. Inria Parietal and NTT Communication Science Laboratories
Stochastic Optimization of Sorting Networks via Continuous Relaxations
- Aditya Grover, Eric Wang, Aaron Zweig, Stefano Ermon
Differentiable Ranks and Sorting using Optimal Transport
- Guy Lorberbom, Andreea Gane, Tommi Jaakkola, and Tamir Hazan
Reparameterizing the Birkhoff Polytope for Variational Permutation Inference. AISTATS 2018
- Scott W. Linderman, Gonzalo E. Mena, Hal Cooper, Liam Paninski, John P. Cunningham.
A Regularized Framework for Sparse and Structured Neural Attention. NeurIPS 2017
SparseMAP: Differentiable Sparse Structured Inference. ICML 2018

Information Theory

♦︎ Elements of Information Theory. Cover and Thomas. 1991
♦︎ On Variational Bounds of Mutual Information. ICML 2019
- Ben Poole, Sherjil Ozair, Aaron van den Oord, Alexander A. Alemi, George Tucker
- A comprehensive discussion of all these MI variational bounds
Learning Deep Representations By Mutual Information Estimation And Maximization. ICLR 2019
- R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio
- A detailed comparison between different MI estimators, section 3.2.
MINE: Mutual Information Neural Estimation
- R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, Yoshua Bengio
Deep Variational Information Bottleneck. ICLR 2017
Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, Kevin Murphy. Google Research

Disentanglement and Interpretability

Identifying Bayesian Mixture Models
- Michael Betancourt
- The source of non-identifiability is symmetry and exchangability in both prior and conditional.
- Two ways of breaking the symmetry:
  - Ordering of the mixture component
  - non-exchangeable prior
Disentangling Disentanglement in Variational Autoencoders. ICML 2019
- Emile Mathieu, Tom Rainforth, N. Siddharth, Yee Whye Teh
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. ICML 2019
- Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem

Invariance

Emergence of Invariance and Disentanglement in Deep Representations
- Alessandro Achillo and Stefano Soatto. UCLA. JMLR 2018
Invariant Risk Minimization
- Martin Arjovsky, Leon Bottou, Ishaan Gulrajani, David Lopez-Paz. 2019.

Posterior Regularization

Posterior Regularization for Structured Latent Variable Models
- Kuzman Ganchev, João Graça, Jennifer Gillenwater, Ben Taskar. JMLR 2010.
Posterior Control of Blackbox Generation
- Xiang Lisa Li and Alexander M. Rush. 2019.

Reflections and Critics

The continuous Bernoulli: fixing a pervasive error in variational autoencoders. NeurIPS 2019
- Gabriel Loaiza-Ganem and John P. Cunningham. Columbia.
- In science, many things are intuitively right yet actually wrong. Discovering these knowledges is always nontrivial and requires inspiration.
- This paper is an example: using the bernoulli on [0, 1] valued data (continuous) is not equivelent to binary data, and will result in a normalization constant gap.
Do Deep Generative Models Know What They Don't Know? ICLR 2019
- Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, Balaji Lakshminarayanan

More Applications.

TODO: summarization; machine translation; dialog

Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization, NIPS 18
- Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, Bill Dolan
Discovering Discrete Latent Topics with Neural Variational Inference, ICML 17
- Yishu Miao, Edward Grefenstette, Phil Blunsom. Oxford
TopicRNN: A Recurrent Neural Network with Long-Range Semantic Dependency, ICLR 17
- Adji B. Dieng, Chong Wang, Jianfeng Gao, John William Paisley
Topic Aware Neural Response Generation, AAAI 17
- Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, Wei-Ying Ma