Awesome Deep Phenomena

Our understanding of modern neural networks lags behind their practical successes. This growing gap poses a challenge to the pace of progress in machine learning because fewer pillars of knowledge are available to designers of models and algorithms (Hanie Sedghi). Inspired by the ICML 2019 workshop Identifying and Understanding Deep Learning Phenomena, I collect papers which present interesting empirical study and insight into the nature of deep learning.

Empirical Study
Neural Collapse
Deep Double Descent
Lottery Ticket Hypothesis
Interactions with Neuroscience
Information Bottleneck
Neural Tangent Kernel
Others

Empirical Study

Todo

Neural Tangent Kernel (Lazy Training): A note on lazy training in supervised differentiable programming, Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee, On Lazy Training in Differentiable Programming, https://m.thepaper.cn/baijiahao_5526219, https://zhuanlan.zhihu.com/p/105871604, 和fisher信息有关？,adam上失效，只有gd和sgd上有这个现象
Double Descent

Empirical Study: 2022

ModelDiff: A Framework for Comparing Learning Algorithms. [paper] [code]
- Harshay Shah, Sung Min Park, Andrew Ilyas, Aleksander Madry.
- Key Word: Representation-based Comparison; Example-level Comparisons; Comparing Feature Attributions.
- Digest
  We study the problem of (learning) algorithm comparison, where the goal is to find differences between models trained with two different learning algorithms. We begin by formalizing this goal as one of finding distinguishing feature transformations, i.e., input transformations that change the predictions of models trained with one learning algorithm but not the other. We then present ModelDiff, a method that leverages the datamodels framework (Ilyas et al., 2022) to compare learning algorithms based on how they use their training data.
Overfreezing Meets Overparameterization: A Double Descent Perspective on Transfer Learning of Deep Neural Networks. [paper]
- Yehuda Dar, Lorenzo Luzi, Richard G. Baraniuk.
- Key Word: Transfer Learning; Deep Double Descent; Overfreezing.
- Digest
  We study the generalization behavior of transfer learning of deep neural networks (DNNs). We adopt the overparameterization perspective -- featuring interpolation of the training data (i.e., approximately zero train error) and the double descent phenomenon -- to explain the delicate effect of the transfer learning setting on generalization performance. We study how the generalization behavior of transfer learning is affected by the dataset size in the source and target tasks, the number of transferred layers that are kept frozen in the target DNN training, and the similarity between the source and target tasks.
What Images are More Memorable to Machines? [paper] [code]
- Junlin Han, Huangying Zhan, Jie Hong, Pengfei Fang, Hongdong Li, Lars Petersson, Ian Reid.
- Key Word: Self-Supervised Memorization Quantification.
- Digest
  This paper studies the problem of measuring and predicting how memorable an image is to pattern recognition machines, as a path to explore machine intelligence. Firstly, we propose a self-supervised machine memory quantification pipeline, dubbed ``MachineMem measurer'', to collect machine memorability scores of images. Similar to humans, machines also tend to memorize certain kinds of images, whereas the types of images that machines and humans memorialize are different.
Harmonizing the object recognition strategies of deep neural networks with humans. [paper] [code]
- Thomas Fel, Ivan Felipe, Drew Linsley, Thomas Serre.
- Key Word: Interpretation; Neural Harmonizer; Psychophysics.
- Digest
  Across 84 different DNNs trained on ImageNet and three independent datasets measuring the where and the how of human visual strategies for object recognition on those images, we find a systematic trade-off between DNN categorization accuracy and alignment with human visual strategies for object recognition. State-of-the-art DNNs are progressively becoming less aligned with humans as their accuracy improves. We rectify this growing issue with our neural harmonizer: a general-purpose training routine that both aligns DNN and human visual strategies and improves categorization accuracy.
Broken Neural Scaling Laws. [paper] [code]
- Ethan Caballero, Kshitij Gupta, Irina Rish, David Krueger.
- Key Word: Neural Scaling Laws.
- Digest
  We present a smoothly broken power law functional form that accurately models and extrapolates the scaling behaviors of deep neural networks (i.e. how the evaluation metric of interest varies as the amount of compute used for training, number of model parameters, or training dataset size varies) for each task within a large and diverse set of upstream and downstream tasks, in zero-shot, prompted, and fine-tuned settings. This set includes large-scale vision and unsupervised language tasks, diffusion generative modeling of images, arithmetic, and reinforcement learning.
Pruning's Effect on Generalization Through the Lens of Training and Regularization. [paper]
- Tian Jin, Michael Carbin, Daniel M. Roy, Jonathan Frankle, Gintare Karolina Dziugaite.
- Key Word: Pruning; Regularization.
- Digest
  We show that size reduction cannot fully account for the generalization-improving effect of standard pruning algorithms. Instead, we find that pruning leads to better training at specific sparsities, improving the training loss over the dense model. We find that pruning also leads to additional regularization at other sparsities, reducing the accuracy degradation due to noisy examples over the dense model. Pruning extends model training time and reduces model size. These two factors improve training and add regularization respectively. We empirically demonstrate that both factors are essential to fully explaining pruning's impact on generalization.
How Much Data Are Augmentations Worth? An Investigation into Scaling Laws, Invariance, and Implicit Regularization. [paper] [code]
- Jonas Geiping, Micah Goldblum, Gowthami Somepalli, Ravid Shwartz-Ziv, Tom Goldstein, Andrew Gordon Wilson.
- Key Word: Data Augmentation; Neural Scaling Laws; Implicit Regularization.
- Digest
  Despite the clear performance benefits of data augmentations, little is known about why they are so effective. In this paper, we disentangle several key mechanisms through which data augmentations operate. Establishing an exchange rate between augmented and additional real data, we find that in out-of-distribution testing scenarios, augmentations which yield samples that are diverse, but inconsistent with the data distribution can be even more valuable than additional training data.
What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries. [paper] [code]
- Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel S. Schoenholz.
- Key Word: Class Manifold; Linear Region; Out-of-Distribution Generalization.
- Digest
  Deep neural network classifiers partition input space into high confidence regions for each class. The geometry of these class manifolds (CMs) is widely studied and intimately related to model performance; for example, the margin depends on CM boundaries. We exploit the notions of Gaussian width and Gordon's escape theorem to tractably estimate the effective dimension of CMs and their boundaries through tomographic intersections with random affine subspaces of varying dimension. We show several connections between the dimension of CMs, generalization, and robustness.
In What Ways Are Deep Neural Networks Invariant and How Should We Measure This? [paper]
- Henry Kvinge, Tegan H. Emerson, Grayson Jorgenson, Scott Vasquez, Timothy Doster, Jesse D. Lew. NeurIPS 2022
- Key Word: Invariance and Equivariance.
- Digest
  We explore the nature of invariance and equivariance of deep learning models with the goal of better understanding the ways in which they actually capture these concepts on a formal level. We introduce a family of invariance and equivariance metrics that allows us to quantify these properties in a way that disentangles them from other metrics such as loss or accuracy.
Omnigrok: Grokking Beyond Algorithmic Data. [paper]
- Ziming Liu, Eric J. Michaud, Max Tegmark.
- Key Word: Grokking Dynamics.
- Digest
  Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks, identifying the mismatch between training and test losses as the cause for grokking. We refer to this as the "LU mechanism" because training and test losses (against model weight norm) typically resemble "L" and "U", respectively. This simple mechanism can nicely explain many aspects of grokking: data size dependence, weight decay dependence, the emergence of representations, etc.
Relative representations enable zero-shot latent space communication. [paper]
- Luca Moschella, Valentino Maiorca, Marco Fumero, Antonio Norelli, Francesco Locatello, Emanuele Rodolà.
- Key Word: Representation Similarity; Model stitching.
- Digest
  Neural networks embed the geometric structure of a data manifold lying in a high-dimensional space into latent representations. Ideally, the distribution of the data points in the latent space should depend only on the task, the data, the loss, and other architecture-specific constraints. However, factors such as the random weights initialization, training hyperparameters, or other sources of randomness in the training phase may induce incoherent latent spaces that hinder any form of reuse. Nevertheless, we empirically observe that, under the same data and modeling choices, distinct latent spaces typically differ by an unknown quasi-isometric transformation: that is, in each space, the distances between the encodings do not change. In this work, we propose to adopt pairwise similarities as an alternative data representation, that can be used to enforce the desired invariance without any additional training.
Minimalistic Unsupervised Learning with the Sparse Manifold Transform. [paper]
- Yubei Chen, Zeyu Yun, Yi Ma, Bruno Olshausen, Yann LeCun.
- Key Word: Self-Supervision; Sparse Manifold Transform.
- Digest
  We describe a minimalistic and interpretable method for unsupervised learning, without resorting to data augmentation, hyperparameter tuning, or other engineering designs, that achieves performance close to the SOTA SSL methods. Our approach leverages the sparse manifold transform, which unifies sparse coding, manifold learning, and slow feature analysis. With a one-layer deterministic sparse manifold transform, one can achieve 99.3% KNN top-1 accuracy on MNIST, 81.1% KNN top-1 accuracy on CIFAR-10 and 53.2% on CIFAR-100.
Revisiting Neural Scaling Laws in Language and Vision. [paper]
- Ibrahim Alabdulmohsin, Behnam Neyshabur, Xiaohua Zhai.
- Key Word: Neural Scaling Laws; Multi-modal Learning.
- Digest
  The remarkable progress in deep learning in recent years is largely driven by improvements in scale, where bigger models are trained on larger datasets for longer schedules. To predict the benefit of scale empirically, we argue for a more rigorous methodology based on the extrapolation loss, instead of reporting the best-fitting (interpolating) parameters. We then present a recipe for estimating scaling law parameters reliably from learning curves. We demonstrate that it extrapolates more accurately than previous methods in a wide range of architecture families across several domains, including image classification, neural machine translation (NMT) and language modeling, in addition to tasks from the BIG-Bench evaluation benchmark.
A Review of Sparse Expert Models in Deep Learning. [paper]
- William Fedus, Jeff Dean, Barret Zoph.
- Key Word: Mixture-of-Experts.
- Digest
  Sparse expert models are a thirty-year old concept re-emerging as a popular architecture in deep learning. This class of architecture encompasses Mixture-of-Experts, Switch Transformers, Routing Networks, BASE layers, and others, all with the unifying idea that each example is acted on by a subset of the parameters. By doing so, the degree of sparsity decouples the parameter count from the compute per example allowing for extremely large, but efficient models. The resulting models have demonstrated significant improvements across diverse domains such as natural language processing, computer vision, and speech recognition. We review the concept of sparse expert models, provide a basic description of the common algorithms, contextualize the advances in the deep learning era, and conclude by highlighting areas for future work.
A Data-Based Perspective on Transfer Learning. [paper] [code]
- Saachi Jain, Hadi Salman, Alaa Khaddaj, Eric Wong, Sung Min Park, Aleksander Madry.
- Key Word: Transfer Learning; Influence Function; Data Leakage.
- Digest
  It is commonly believed that in transfer learning including more pre-training data translates into better performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we take a closer look at the role of the source dataset's composition in transfer learning and present a framework for probing its impact on downstream performance. Our framework gives rise to new capabilities such as pinpointing transfer learning brittleness as well as detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset.
How Much More Data Do I Need? Estimating Requirements for Downstream Tasks. [paper]
- Rafid Mahmood, James Lucas, David Acuna, Daiqing Li, Jonah Philion, Jose M. Alvarez, Zhiding Yu, Sanja Fidler, Marc T. Law. CVPR 2022
- Key Word: Neural Scaling Laws; Active Learning.
- Digest
  Prior work on neural scaling laws suggest that the power-law function can fit the validation performance curve and extrapolate it to larger data set sizes. We find that this does not immediately translate to the more difficult downstream task of estimating the required data set size to meet a target performance. In this work, we consider a broad class of computer vision tasks and systematically investigate a family of functions that generalize the power-law function to allow for better estimation of data requirements.
When Does Re-initialization Work? [paper]
- Sheheryar Zaidi, Tudor Berariu, Hyunjik Kim, Jörg Bornschein, Claudia Clopath, Yee Whye Teh, Razvan Pascanu.
- Key Word: Re-initialization; Regularization.
- Digest
  We conduct an extensive empirical comparison of standard training with a selection of re-initialization methods to answer this question, training over 15,000 models on a variety of image classification benchmarks. We first establish that such methods are consistently beneficial for generalization in the absence of any other regularization. However, when deployed alongside other carefully tuned regularization techniques, re-initialization methods offer little to no added benefit for generalization, although optimal generalization performance becomes less sensitive to the choice of learning rate and weight decay hyperparameters. To investigate the impact of re-initialization methods on noisy data, we also consider learning under label noise. Surprisingly, in this case, re-initialization significantly improves upon standard training, even in the presence of other carefully tuned regularization techniques.
How You Start Matters for Generalization. [paper]
- Sameera Ramasinghe, Lachlan MacDonald, Moshiur Farazi, Hemanth Sartachandran, Simon Lucey.
- Key Word: Implicit regularization; Fourier Spectrum.
- Digest
  We promote a shift of focus towards initialization rather than neural architecture or (stochastic) gradient descent to explain this implicit regularization. Through a Fourier lens, we derive a general result for the spectral bias of neural networks and show that the generalization of neural networks is heavily tied to their initialization. Further, we empirically solidify the developed theoretical insights using practical, deep networks.
Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? [paper] [code]
- Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, Luke Zettlemoyer.
- Key Word: Natural Language Processing; In-Context Learning.
- Digest
  We show that ground truth demonstrations are in fact not required -- randomly replacing labels in the demonstrations barely hurts performance, consistently over 12 different models including GPT-3. Instead, we find that other aspects of the demonstrations are the key drivers of end task performance, including the fact that they provide a few examples of (1) the label space, (2) the distribution of the input text, and (3) the overall format of the sequence.

Empirical Study: 2021

Masked Autoencoders Are Scalable Vision Learners. [paper] [code]
- Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick. CVPR 2022
- Key Word: Self-Supervision; Autoencoders.
- Digest
  This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task.
Learning in High Dimension Always Amounts to Extrapolation. [paper]
- Randall Balestriero, Jerome Pesenti, Yann LeCun.
- Key Word: Interpolation and Extrapolation.
- Digest
  The notion of interpolation and extrapolation is fundamental in various fields from deep learning to function approximation. Interpolation occurs for a sample x whenever this sample falls inside or on the boundary of the given dataset's convex hull. Extrapolation occurs when x falls outside of that convex hull. One fundamental (mis)conception is that state-of-the-art algorithms work so well because of their ability to correctly interpolate training data. A second (mis)conception is that interpolation happens throughout tasks and datasets, in fact, many intuitions and theories rely on that assumption. We empirically and theoretically argue against those two points and demonstrate that on any high-dimensional (>100) dataset, interpolation almost surely never happens.
Understanding Dataset Difficulty with V-Usable Information. [paper] [code]
- Kawin Ethayarajh, Yejin Choi, Swabha Swayamdipta. ICML 2022
- Key Word: Dataset Difficulty Measures; Information Theory.
- Digest
  Estimating the difficulty of a dataset typically involves comparing state-of-the-art models to humans; the bigger the performance gap, the harder the dataset is said to be. However, this comparison provides little understanding of how difficult each instance in a given distribution is, or what attributes make the dataset difficult for a given model. To address these questions, we frame dataset difficulty -- w.r.t. a model V -- as the lack of V-usable information (Xu et al., 2019), where a lower value indicates a more difficult dataset for V. We further introduce pointwise V-information (PVI) for measuring the difficulty of individual instances w.r.t. a given distribution.
Exploring the Limits of Large Scale Pre-training. [paper]
- Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi. ICLR 2022
- Key Word: Pre-training.
- Digest
  We investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and ResNets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, ImageNet21K) and evaluated on more than 20 downstream image recognition tasks. We propose a model for downstream performance that reflects the saturation phenomena and captures the nonlinear relationship in performance of upstream and downstream tasks.
Stochastic Training is Not Necessary for Generalization. [paper] [code]
- Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom Goldstein. ICLR 2022
- Key Word: Stochastic Gradient Descent; Regularization.
- Digest
  It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline.
Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization. [paper]
- Chiyuan Zhang, Maithra Raghu, Jon Kleinberg, Samy Bengio.
- Key Word: Out-of-Distribution Generalization.
- Digest
  In this paper we introduce a novel benchmark, Pointer Value Retrieval (PVR) tasks, that explore the limits of neural network generalization. We demonstrate that this task structure provides a rich testbed for understanding generalization, with our empirical study showing large variations in neural network performance based on dataset size, task complexity and model architecture.
What can linear interpolation of neural network loss landscapes tell us? [paper]
- Tiffany Vlaar, Jonathan Frankle. ICML 2022
- Key Word: Linear Interpolation; Loss Landscapes.
- Digest
  We put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization, but that the shape of the linear path is not indicative of the changes in test accuracy of the model.
Can Vision Transformers Learn without Natural Images? [paper] [code]
- Kodai Nakashima, Hirokatsu Kataoka, Asato Matsumoto, Kenji Iwata, Nakamasa Inoue. AAAI 2022
- Key Word: Formula-driven Supervised Learning; Vision Transformer.
- Digest
  We pre-train ViT without any image collections and annotation labor. We experimentally verify that our proposed framework partially outperforms sophisticated Self-Supervised Learning (SSL) methods like SimCLRv2 and MoCov2 without using any natural images in the pre-training phase. Moreover, although the ViT pre-trained without natural images produces some different visualizations from ImageNet pre-trained ViT, it can interpret natural image datasets to a large extent.
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. [paper] [code]
- Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, Ameet Talwalkar. ICLR 2021
- Key Word: Edge of Stability.
- Digest
  We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value 2/(step size), and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training.
Pre-training without Natural Images. [paper] [code]
- Hirokatsu Kataoka, Kazushige Okayasu, Asato Matsumoto, Eisuke Yamagata, Ryosuke Yamada, Nakamasa Inoue, Akio Nakamura, Yutaka Satoh. ACCV 2020
- Key Word: Formula-driven Supervised Learning.
- Digest
  The paper proposes a novel concept, Formula-driven Supervised Learning. We automatically generate image patterns and their category labels by assigning fractals, which are based on a natural law existing in the background knowledge of the real world. Theoretically, the use of automatically generated images instead of natural images in the pre-training phase allows us to generate an infinite scale dataset of labeled images. Although the models pre-trained with the proposed Fractal DataBase (FractalDB), a database without natural images, does not necessarily outperform models pre-trained with human annotated datasets at all settings, we are able to partially surpass the accuracy of ImageNet/Places pre-trained models.
Double Descent: On the Role of Optimization in Double Descent: A Least Squares Study. [paper]
- Ilja Kuzborskij, Csaba Szepesvári, Omar Rivasplata, Amal Rannen-Triki, Razvan Pascanu
- Key Word: Double Descent, Generalization
- Digest
  Empirically it has been observed that the performance of deep neural networks steadily improves as we increase model size, contradicting the classical view on overﬁtting and generalization. The double descent phenomena has been proposed to reconcile this observation with theory, suggesting that the test error has a second descent when the model becomes sufﬁciently overparametrized, as the model size itself acts as an implicit regularizer. This paper provides a careful study of learning dynamics as a function of model size for the least squares scenario.

Empirical Study: 2020

When Do Curricula Work? [paper] [code]
- Xiaoxia Wu, Ethan Dyer, Behnam Neyshabur. ICLR 2021
- Key Word: Curriculum Learning.
- Digest
  We set out to investigate the relative benefits of ordered learning. We first investigate the implicit curricula resulting from architectural and optimization bias and find that samples are learned in a highly consistent order. Next, to quantify the benefit of explicit curricula, we conduct extensive experiments over thousands of orderings spanning three kinds of learning: curriculum, anti-curriculum, and random-curriculum -- in which the size of the training dataset is dynamically increased over time, but the examples are randomly ordered.
In Search of Robust Measures of Generalization. [paper] [code]
- Gintare Karolina Dziugaite, Alexandre Drouin, Brady Neal, Nitarshan Rajkumar, Ethan Caballero, Linbo Wang, Ioannis Mitliagkas, Daniel M. Roy. NeurIPS 2020
- Key Word: Generalization Measures.
- Digest
  One of the principal scientific challenges in deep learning is explaining generalization, i.e., why the particular way the community now trains networks to achieve small training error also leads to small error on held-out data from the same population. It is widely appreciated that some worst-case theories -- such as those based on the VC dimension of the class of predictors induced by modern neural network architectures -- are unable to explain empirical performance. A large volume of work aims to close this gap, primarily by developing bounds on generalization error, optimization error, and excess risk. When evaluated empirically, however, most of these bounds are numerically vacuous. Focusing on generalization bounds, this work addresses the question of how to evaluate such bounds empirically.
The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers. [paper] [code]
- Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi. ICLR 2021
- Key Word: Online Learning; Finite-Sample Deviations.
- Digest
  We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If the gap (2) is universally small, this reduces the problem of generalization in offline learning to the problem of optimization in online learning.
Characterising Bias in Compressed Models. [paper]
- Sara Hooker, Nyalleng Moorosi, Gregory Clark, Samy Bengio, Emily Denton.
- Key Word: Pruning; Fairness.
- Digest
  The popularity and widespread use of pruning and quantization is driven by the severe resource constraints of deploying deep neural networks to environments with strict latency, memory and energy requirements. These techniques achieve high levels of compression with negligible impact on top-line metrics (top-1 and top-5 accuracy). However, overall accuracy hides disproportionately high errors on a small subset of examples; we call this subset Compression Identified Exemplars (CIE).
Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics. [paper] [code]
- Swabha Swayamdipta, Roy Schwartz, Nicholas Lourie, Yizhong Wang, Hannaneh Hajishirzi, Noah A. Smith, Yejin Choi. EMNLP 2020
- Key Word: Training Dynamics; Data Map; Curriculum Learning.
- Digest
  Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps.
What is being transferred in transfer learning? [paper] [code]
- Behnam Neyshabur, Hanie Sedghi, Chiyuan Zhang. NeurIPS 2020
- Key Word: Transfer Learning.
- Digest
  We provide new tools and analyses to address these fundamental questions. Through a series of analyses on transferring to block-shuffled images, we separate the effect of feature reuse from learning low-level statistics of data and show that some benefit of transfer learning comes from the latter. We present that when training from pre-trained weights, the model stays in the same basin in the loss landscape and different instances of such model are similar in feature space and close in parameter space.
Deep Isometric Learning for Visual Recognition. [paper] [code]
- Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, Jitendra Malik. ICML 2020
- Key Word: Isometric Networks.
- Digest
  This paper shows that deep vanilla ConvNets without normalization nor skip connections can also be trained to achieve surprisingly good performance on standard image recognition benchmarks. This is achieved by enforcing the convolution kernels to be near isometric during initialization and training, as well as by using a variant of ReLU that is shifted towards being isometric.
On the Generalization Benefit of Noise in Stochastic Gradient Descent. [paper]
- Samuel L. Smith, Erich Elsen, Soham De. ICML 2020
- Key Word: Stochastic Gradient Descent.
- Digest
  In this paper, we perform carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. This occurs even when both models are trained for the same number of iterations and large batches achieve smaller training losses.
Do CNNs Encode Data Augmentations? [paper]
- Eddie Yan, Yanping Huang.
- Key Word: Data Augmentations.
- Digest
  Surprisingly, neural network features not only predict data augmentation transformations, but they predict many transformations with high accuracy. After validating that neural networks encode features corresponding to augmentation transformations, we show that these features are primarily encoded in the early layers of modern CNNs.
Do We Need Zero Training Loss After Achieving Zero Training Error? [paper] [code]
- Takashi Ishida, Ikko Yamane, Tomoya Sakai, Gang Niu, Masashi Sugiyama. ICML 2020
- Key Word: Regularization.
- Digest
  Our approach makes the loss float around the flooding level by doing mini-batched gradient descent as usual but gradient ascent if the training loss is below the flooding level. This can be implemented with one line of code, and is compatible with any stochastic optimizer and other regularizers. We experimentally show that flooding improves performance and as a byproduct, induces a double descent curve of the test loss.
Understanding Why Neural Networks Generalize Well Through GSNR of Parameters. [paper]
- Jinlong Liu, Guoqing Jiang, Yunzhi Bai, Ting Chen, Huayan Wang. ICLR 2020
- Key Word: Generalization Indicators.
- Digest
  In this paper, we provide a novel perspective on these issues using the gradient signal to noise ratio (GSNR) of parameters during training process of DNNs. The GSNR of a parameter is defined as the ratio between its gradient's squared mean and variance, over the data distribution.

Empirical Study: 2019

Angular Visual Hardness. [paper]
- Beidi Chen, Weiyang Liu, Zhiding Yu, Jan Kautz, Anshumali Shrivastava, Animesh Garg, Anima Anandkumar. ICML 2020
- Key Word: Calibration; Example Hardness Measures.
- Digest
  We propose a novel measure for CNN models known as Angular Visual Hardness. Our comprehensive empirical studies show that AVH can serve as an indicator of generalization abilities of neural networks, and improving SOTA accuracy entails improving accuracy on hard example
Fantastic Generalization Measures and Where to Find Them. [paper] [code]
- Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, Samy Bengio. ICLR 2020
- Key Word: Complexity Measures; Spurious Correlations.
- Digest
  We present the first large scale study of generalization in deep networks. We investigate more then 40 complexity measures taken from both theoretical bounds and empirical studies. We train over 10,000 convolutional networks by systematically varying commonly used hyperparameters. Hoping to uncover potentially causal relationships between each measure and generalization, we analyze carefully controlled experiments and show surprising failures of some measures as well as promising measures for further research.
Truth or Backpropaganda? An Empirical Investigation of Deep Learning Theory. [paper] [code]
- Micah Goldblum, Jonas Geiping, Avi Schwarzschild, Michael Moeller, Tom Goldstein. ICLR 2020
- Key Word: Local Minima.
- Digest
  The authors take a closer look at widely held beliefs about neural networks. Using a mix of analysis and experiment, they shed some light on the ways these assumptions break down.
Rapid Learning or Feature Reuse? Towards Understanding the Effectiveness of MAML. [paper] [code]
- Aniruddh Raghu, Maithra Raghu, Samy Bengio, Oriol Vinyals. ICLR 2020
- Key Word: Meta Learning.
- Digest
  Despite MAML's popularity, a fundamental open question remains -- is the effectiveness of MAML due to the meta-initialization being primed for rapid learning (large, efficient changes in the representations) or due to feature reuse, with the meta initialization already containing high quality features? We investigate this question, via ablation studies and analysis of the latent representations, finding that feature reuse is the dominant factor.
Finding the Needle in the Haystack with Convolutions: on the benefits of architectural bias. [paper] [code]
- Stéphane d'Ascoli, Levent Sagun, Joan Bruna, Giulio Biroli. NeurIPS 2019
- Key Word: Architectural Bias.
- Digest
  In particular, Convolutional Neural Networks (CNNs) are known to perform much better than Fully-Connected Networks (FCNs) on spatially structured data: the architectural structure of CNNs benefits from prior knowledge on the features of the data, for instance their translation invariance. The aim of this work is to understand this fact through the lens of dynamics in the loss landscape.
Adversarial Training Can Hurt Generalization. [paper]
- Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C. Duchi, Percy Liang.
- Key Word: Adversarial Examples.
- Digest
  While adversarial training can improve robust accuracy (against an adversary), it sometimes hurts standard accuracy (when there is no adversary). Previous work has studied this tradeoff between standard and robust accuracy, but only in the setting where no predictor performs well on both objectives in the infinite data limit. In this paper, we show that even when the optimal predictor with infinite data performs well on both objectives, a tradeoff can still manifest itself with finite data.
Bad Global Minima Exist and SGD Can Reach Them. [paper] [code]
- Shengchao Liu, Dimitris Papailiopoulos, Dimitris Achlioptas. NeurIPS 2020
- Key Word: Stochastic Gradient Descent.
- Digest
  Several works have aimed to explain why overparameterized neural networks generalize well when trained by Stochastic Gradient Descent (SGD). The consensus explanation that has emerged credits the randomized nature of SGD for the bias of the training process towards low-complexity models and, thus, for implicit regularization. We take a careful look at this explanation in the context of image classification with common deep neural network architectures. We find that if we do not regularize explicitly, then SGD can be easily made to converge to poorly-generalizing, high-complexity models: all it takes is to first train on a random labeling on the data, before switching to properly training with the correct labels.
Deep ReLU Networks Have Surprisingly Few Activation Patterns. [paper]
- Boris Hanin, David Rolnick. NeurIPS 2019
- Digest
  In this paper, we show that the average number of activation patterns for ReLU networks at initialization is bounded by the total number of neurons raised to the input dimension. We show empirically that this bound, which is independent of the depth, is tight both at initialization and during training, even on memorization tasks that should maximize the number of activation patterns.
Sensitivity of Deep Convolutional Networks to Gabor Noise. [paper] [code]
- Kenneth T. Co, Luis Muñoz-González, Emil C. Lupu.
- Key Word: Robustness.
- Digest
  Deep Convolutional Networks (DCNs) have been shown to be sensitive to Universal Adversarial Perturbations (UAPs): input-agnostic perturbations that fool a model on large portions of a dataset. These UAPs exhibit interesting visual patterns, but this phenomena is, as yet, poorly understood. Our work shows that visually similar procedural noise patterns also act as UAPs. In particular, we demonstrate that different DCN architectures are sensitive to Gabor noise patterns. This behaviour, its causes, and implications deserve further in-depth study.
Rethinking the Usage of Batch Normalization and Dropout in the Training of Deep Neural Networks. [paper]
- Guangyong Chen, Pengfei Chen, Yujun Shi, Chang-Yu Hsieh, Benben Liao, Shengyu Zhang.
- Key Word: Batch Normalization; Dropout.
- Digest
  Our work is based on an excellent idea that whitening the inputs of neural networks can achieve a fast convergence speed. Given the well-known fact that independent components must be whitened, we introduce a novel Independent-Component (IC) layer before each weight layer, whose inputs would be made more independent.
A critical analysis of self-supervision, or what we can learn from a single image. [paper] [code]
- Yuki M. Asano, Christian Rupprecht, Andrea Vedaldi. ICLR 2020
- Key Word: Self-Supervision.
- Digest
  We show that three different and representative methods, BiGAN, RotNet and DeepCluster, can learn the first few layers of a convolutional network from a single image as well as using millions of images and manual labels, provided that strong data augmentation is used. However, for deeper layers the gap with manual supervision cannot be closed even if millions of unlabelled images are used for training.
Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet. [paper] [code]
- Wieland Brendel, Matthias Bethge. ICLR 2019
- Key Word: Bag-of-Features.
- Digest
  Our model, a simple variant of the ResNet-50 architecture called BagNet, classifies an image based on the occurrences of small local image features without taking into account their spatial ordering. This strategy is closely related to the bag-of-feature (BoF) models popular before the onset of deep learning and reaches a surprisingly high accuracy on ImageNet.
Transfusion: Understanding Transfer Learning for Medical Imaging. [paper] [code]
- Maithra Raghu, Chiyuan Zhang, Jon Kleinberg, Samy Bengio. NeurIPS 2019
- Key Word: Transfer Learning; Medical Imaging.
- Digest
  we explore properties of transfer learning for medical imaging. A performance evaluation on two large scale medical imaging tasks shows that surprisingly, transfer offers little benefit to performance, and simple, lightweight models can perform comparably to ImageNet architectures.
Identity Crisis: Memorization and Generalization under Extreme Overparameterization. [paper]
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Michael C. Mozer, Yoram Singer. ICLR 2020
- Key Word: Memorization.
- Digest
  We study the interplay between memorization and generalization of overparameterized networks in the extreme case of a single training example and an identity-mapping task.
Are All Layers Created Equal? [paper]
- Chiyuan Zhang, Samy Bengio, Yoram Singer. JMLR
- Key Word: Robustness.
- Digest
  We show that the layers can be categorized as either "ambient" or "critical". Resetting the ambient layers to their initial values has no negative consequence, and in many cases they barely change throughout training. On the contrary, resetting the critical layers completely destroys the predictor and the performance drops to chance.
Reconciling modern machine-learning practice and the classical bias–variance trade-off. [paper]
- Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal
- Key Word: Double Descent, Generalization
- Digest

Empirical Study: 2018

Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. [paper] [code]
- Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf. CVPR 2019
- Key Word: ReLU.
- Digest
  Classifiers used in the wild, in particular for safety-critical systems, should not only have good generalization properties but also should know when they don't know, in particular make low confidence predictions far away from the training data. We show that ReLU type neural networks which yield a piecewise linear classifier function fail in this regard as they produce almost always high confidence predictions far away from the training data.
An Empirical Study of Example Forgetting during Deep Neural Network Learning. [paper] [code]
- Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, Geoffrey J. Gordon. ICLR 2019
- Key Word: Curriculum Learning; Sample Weighting; Example Forgetting.
- Digest
  We define a 'forgetting event' to have occurred when an individual training example transitions from being classified correctly to incorrectly over the course of learning. Across several benchmark data sets, we find that: (i) certain examples are forgotten with high frequency, and some not at all; (ii) a data set's (un)forgettable examples generalize across neural architectures; and (iii) based on forgetting dynamics, a significant fraction of examples can be omitted from the training data set while still maintaining state-of-the-art generalization performance.
On Implicit Filter Level Sparsity in Convolutional Neural Networks. [paper]
- Dushyant Mehta, Kwang In Kim, Christian Theobalt. CVPR 2019
- Key Word: Regularization; Sparsification.
- Digest
  We investigate filter level sparsity that emerges in convolutional neural networks (CNNs) which employ Batch Normalization and ReLU activation, and are trained with adaptive gradient descent techniques and L2 regularization or weight decay. We conduct an extensive experimental study casting our initial findings into hypotheses and conclusions about the mechanisms underlying the emergent filter level sparsity. This study allows new insight into the performance gap obeserved between adapative and non-adaptive gradient descent methods in practice.
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. [paper] [code]
- Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar Rätsch, Sylvain Gelly, Bernhard Schölkopf, Olivier Bachem. ICML 2019
- Key Word: Disentanglement.
- Digest
  Our results suggest that future work on disentanglement learning should be explicit about the role of inductive biases and (implicit) supervision, investigate concrete benefits of enforcing disentanglement of the learned representations, and consider a reproducible experimental setup covering several data sets.
Insights on representational similarity in neural networks with canonical correlation. [paper] [code]
- Ari S. Morcos, Maithra Raghu, Samy Bengio. NeurIPS 2018
- Key Word: Representational Similarity.
- Digest
  Comparing representations in neural networks is fundamentally difficult as the structure of representations varies greatly, even across groups of networks trained on identical tasks, and over the course of training. Here, we develop projection weighted CCA (Canonical Correlation Analysis) as a tool for understanding neural networks, building off of SVCCA.
Layer rotation: a surprisingly powerful indicator of generalization in deep networks? [paper] [code]
- Simon Carbonnelle, Christophe De Vleeschouwer.
- Key Word: Weight Evolution.
- Digest
  Our work presents extensive empirical evidence that layer rotation, i.e. the evolution across training of the cosine distance between each layer's weight vector and its initialization, constitutes an impressively consistent indicator of generalization performance. In particular, larger cosine distances between final and initial weights of each layer consistently translate into better generalization performance of the final model.
Sensitivity and Generalization in Neural Networks: an Empirical Study. [paper]
- Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein. ICLR 2018
- Key Word: Sensitivity.
- Digest
  In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization.

Empirical Study: 2017

Deep Image Prior. [paper] [code]
- Dmitry Ulyanov, Andrea Vedaldi, Victor Lempitsky.
- Key Word: Low-Level Vision.
- Digest
  In this paper, we show that, on the contrary, the structure of a generator network is sufficient to capture a great deal of low-level image statistics prior to any learning. In order to do so, we show that a randomly-initialized neural network can be used as a handcrafted prior with excellent results in standard inverse problems such as denoising, super-resolution, and inpainting.
Critical Learning Periods in Deep Neural Networks. [paper]
- Alessandro Achille, Matteo Rovere, Stefano Soatto. ICLR 2019
- Key Word: Memorization.
- Digest
  Our findings indicate that the early transient is critical in determining the final solution of the optimization associated with training an artificial neural network. In particular, the effects of sensory deficits during a critical period cannot be overcome, no matter how much additional training is performed.
A Closer Look at Memorization in Deep Networks. [paper]
- Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, Simon Lacoste-Julien. ICML 2017
- Key Word: Memorization.
- Digest
  In our experiments, we expose qualitative differences in gradient-based optimization of deep neural networks (DNNs) on noise vs. real data. We also demonstrate that for appropriately tuned explicit regularization (e.g., dropout) we can degrade DNN training performance on noise datasets without compromising generalization on real data.

Empirical Study: 2016

Understanding deep learning requires rethinking generalization. [paper]
- Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, Oriol Vinyals. ICLR 2017
- Key Word: Memorization.
- Digest
  Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data.

Neural Collapse

Neural Collapse 2022

Perturbation Analysis of Neural Collapse. [paper]
- Tom Tirer, Haoxiang Huang, Jonathan Niles-Weed.
- Key Word: Neural Collapse.
- Digest
  We propose a richer model that can capture this phenomenon by forcing the features to stay in the vicinity of a predefined features matrix (e.g., intermediate features). We explore the model in the small vicinity case via perturbation analysis and establish results that cannot be obtained by the previously studied models.
Grokking phase transitions in learning local rules with gradient descent. [paper]
- Bojan Žunkovič, Enej Ilievski.
- Key Word: Tensor Network; Grokking; Many-Body Quantum Mechanics; Neural Collapse.
- Digest
  We discuss two solvable grokking (generalisation beyond overfitting) models in a rule learning scenario. We show that grokking is a phase transition and find exact analytic expressions for the critical exponents, grokking probability, and grokking time distribution. Further, we introduce a tensor-network map that connects the proposed grokking setup with the standard (perceptron) statistical learning theory and show that grokking is a consequence of the locality of the teacher model. As an example, we analyse the cellular automata learning task, numerically determine the critical exponent and the grokking time distributions and compare them with the prediction of the proposed grokking model. Finally, we numerically analyse the connection between structure formation and grokking.
Imbalance Trouble: Revisiting Neural-Collapse Geometry. [paper]
- Christos Thrampoulidis, Ganesh R. Kini, Vala Vakilian, Tina Behnia.
- Key Word: Neural Collapse; Class Imbalance.
- Digest
  Neural Collapse refers to the remarkable structural properties characterizing the geometry of class embeddings and classifier weights, found by deep nets when trained beyond zero training error. However, this characterization only holds for balanced data. Here we thus ask whether it can be made invariant to class imbalances. Towards this end, we adopt the unconstrained-features model (UFM), a recent theoretical model for studying neural collapse, and introduce Simplex-Encoded-Labels Interpolation (SELI) as an invariant characterization of the neural collapse phenomenon.
Neural Collapse: A Review on Modelling Principles and Generalization. [paper]
- Vignesh Kothapalli, Ebrahim Rasromani, Vasudev Awatramani.
- Key Word: Neural Collapse.
- Digest
  We analyse the principles which aid in modelling such a phenomena from the ground up and show how they can build a common understanding of the recently proposed models that try to explain NC. We hope that our analysis presents a multifaceted perspective on modelling NC and aids in forming connections with the generalization capabilities of neural networks. Finally, we conclude by discussing the avenues for further research and propose potential research problems.
Do We Really Need a Learnable Classifier at the End of Deep Neural Network? [paper]
- Yibo Yang, Liang Xie, Shixiang Chen, Xiangtai Li, Zhouchen Lin, Dacheng Tao.
- Key Word: Neural Collapse.
- Digest
  We study the potential of training a network with the last-layer linear classifier randomly initialized as a simplex ETF and fixed during training. This practice enjoys theoretical merits under the layer-peeled analytical framework. We further develop a simple loss function specifically for the ETF classifier. Its advantage gets verified by both theoretical and experimental results.
Limitations of Neural Collapse for Understanding Generalization in Deep Learning. [paper]
- Like Hui, Mikhail Belkin, Preetum Nakkiran.
- Key Word: Neural Collapse.
- Digest
  We point out that Neural Collapse is primarily an optimization phenomenon, not a generalization one, by investigating the train collapse and test collapse on various dataset and architecture combinations. We propose more precise definitions — "strong" and "weak" Neural Collapse for both the train set and the test set — and discuss their theoretical feasibility.

Neural Collapse 2021

On the Role of Neural Collapse in Transfer Learning. [paper]
- Tomer Galanti, András György, Marcus Hutter. ICLR 2022
- Key Word: Neural Collapse; Transfer Learning.
- Digest
  We provide an explanation for this behavior based on the recently observed phenomenon that the features learned by overparameterized classification networks show an interesting clustering property, called neural collapse.
An Unconstrained Layer-Peeled Perspective on Neural Collapse. [paper]
- Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, Weijie J. Su. ICLR 2022
- Key Word: Neural Collapse; Uncostrained Model; Implicit Regularization.
- Digest
  We introduce a surrogate model called the unconstrained layer-peeled model (ULPM). We prove that gradient flow on this model converges to critical points of a minimum-norm separation problem exhibiting neural collapse in its global minimizer. Moreover, we show that the ULPM with the cross-entropy loss has a benign global landscape for its loss function, which allows us to prove that all the critical points are strict saddle points except the global minimizers that exhibit the neural collapse phenomenon.
Neural Collapse Under MSE Loss: Proximity to and Dynamics on the Central Path. [paper]
- X.Y. Han, Vardan Papyan, David L. Donoho. ICLR 2022
- Key Word: Neural Collapse; Gradient Flow.
- Digest
  The analytically-tractable MSE loss offers more mathematical opportunities than the hard-to-analyze CE loss, inspiring us to leverage MSE loss towards the theoretical investigation of NC. We develop three main contributions: (I) We show a new decomposition of the MSE loss into (A) terms directly interpretable through the lens of NC and which assume the last-layer classifier is exactly the least-squares classifier; and (B) a term capturing the deviation from this least-squares classifier. (II) We exhibit experiments on canonical datasets and networks demonstrating that term-(B) is negligible during training. This motivates us to introduce a new theoretical construct: the central path, where the linear classifier stays MSE-optimal for feature activations throughout the dynamics. (III) By studying renormalized gradient flow along the central path, we derive exact dynamics that predict NC.
A Geometric Analysis of Neural Collapse with Unconstrained Features. [paper] [code]
- Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, Qing Qu. NeurIPS 2021
- Key Word: Neural Collapse, Nonconvex Optimization.
- Digest
  We provide the first global optimization landscape analysis of Neural Collapse -- an intriguing empirical phenomenon that arises in the last-layer classifiers and features of neural networks during the terminal phase of training. As recently reported by Papyan et al., this phenomenon implies that (i) the class means and the last-layer classifiers all collapse to the vertices of a Simplex Equiangular Tight Frame (ETF) up to scaling, and (ii) cross-example within-class variability of last-layer activations collapses to zero. We study the problem based on a simplified unconstrained feature model, which isolates the topmost layers from the classifier of the neural network.
Exploring Deep Neural Networks via Layer-Peeled Model: Minority Collapse in Imbalanced Training. [paper] [code]
- Cong Fang, Hangfeng He, Qi Long, Weijie J. Su. PNAS
- Key Word: Neural Collapse; Imbalanced Training.
- Digest
  In this paper, we introduce the Layer-Peeled Model, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts of the network. When moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term Minority Collapse, which fundamentally limits the performance of deep learning models on the minority classes.

Neural Collapse 2020

Prevalence of Neural Collapse during the terminal phase of deep learning training. [paper] [code]
- Vardan Papyan, X.Y. Han, David L. Donoho. PNAS
- Key Word: Neural Collapse.
- Digest
  This paper studied the terminal phase of training (TPT) of today’s canonical deepnet training protocol. It documented that during TPT a process called Neural Collapse takes place, involving four fundamental and interconnected phenomena: (NC1)-(NC4).

Deep Double Descent

Deep Double Descent 2022

Sparse Double Descent: Where Network Pruning Aggravates Overfitting. [paper] [code]
- Zheng He, Zeke Xie, Quanzhi Zhu, Zengchang Qin. ICML 2022
- Key Word: Deep Double Descent; Lottery Ticket Hypothesis.
- Digest
  While recent studies focused on the deep double descent with respect to model overparameterization, they failed to recognize that sparsity may also cause double descent. In this paper, we have three main contributions. First, we report the novel sparse double descent phenomenon through extensive experiments. Second, for this phenomenon, we propose a novel learning distance interpretation that the curve of ℓ2 learning distance of sparse models (from initialized parameters to final parameters) may correlate with the sparse double descent curve well and reflect generalization better than minima flatness. Third, in the context of sparse double descent, a winning ticket in the lottery ticket hypothesis surprisingly may not always win.
Can Neural Nets Learn the Same Model Twice? Investigating Reproducibility and Double Descent from the Decision Boundary Perspective. [paper] [code]
- Gowthami Somepalli, Liam Fowl, Arpit Bansal, Ping Yeh-Chiang, Yehuda Dar, Richard Baraniuk, Micah Goldblum, Tom Goldstein. CVPR 2022
- Key Word: Deep Double Descent; Manifold.
- Digest
  We discuss methods for visualizing neural network decision boundaries and decision regions. We use these visualizations to investigate issues related to reproducibility and generalization in neural network training. We observe that changes in model architecture (and its associate inductive bias) cause visible changes in decision boundaries, while multiple runs with the same architecture yield results with strong similarities, especially in the case of wide architectures. We also use decision boundary methods to visualize double descent phenomena.
Phenomenology of Double Descent in Finite-Width Neural Networks. [paper] [code]
- Sidak Pal Singh, Aurelien Lucchi, Thomas Hofmann, Bernhard Schölkopf. ICLR 2022
- Key Word: Deep Double Descent.
- Digest
  'Double descent' delineates the generalization behaviour of models depending on the regime they belong to: under- or over-parameterized. The current theoretical understanding behind the occurrence of this phenomenon is primarily based on linear and kernel regression models -- with informal parallels to neural networks via the Neural Tangent Kernel. Therefore such analyses do not adequately capture the mechanisms behind double descent in finite-width neural networks, as well as, disregard crucial components -- such as the choice of the loss function. We address these shortcomings by leveraging influence functions in order to derive suitable expressions of the population loss and its lower bound, while imposing minimal assumptions on the form of the parametric model.

Deep Double Descent 2021

Multi-scale Feature Learning Dynamics: Insights for Double Descent. [paper] [code]
- Mohammad Pezeshki, Amartya Mitra, Yoshua Bengio, Guillaume Lajoie.
- Key Word: Deep Double Descent.
- Digest
  We investigate the origins of the less studied epoch-wise double descent in which the test error undergoes two non-monotonous transitions, or descents as the training time increases. By leveraging tools from statistical physics, we study a linear teacher-student setup exhibiting epoch-wise double descent similar to that in deep neural networks. In this setting, we derive closed-form analytical expressions for the evolution of generalization error over training. We find that double descent can be attributed to distinct features being learned at different scales: as fast-learning features overfit, slower-learning features start to fit, resulting in a second descent in test error.
Asymptotic Risk of Overparameterized Likelihood Models: Double Descent Theory for Deep Neural Networks. [paper]
- Ryumei Nakada, Masaaki Imaizumi.
- Key Word: Deep Double Descent.
- Digest
  We consider a likelihood maximization problem without the model constraints and analyze the upper bound of an asymptotic risk of an estimator with penalization. Technically, we combine a property of the Fisher information matrix with an extended Marchenko-Pastur law and associate the combination with empirical process techniques. The derived bound is general, as it describes both the double descent and the regularized risk curves, depending on the penalization.
Distilling Double Descent. [paper]
- Andrew Cotter, Aditya Krishna Menon, Harikrishna Narasimhan, Ankit Singh Rawat, Sashank J. Reddi, Yichen Zhou.
- Key Word: Deep Double Descent; Distillation.
- Digest
  Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model, which itself is trained on a labeled dataset. The most common explanations for why distillation "works" are predicated on the assumption that student is provided with soft labels, e.g. probabilities or confidences, from the teacher model. In this work, we show, that, even when the teacher model is highly overparameterized, and provides hard labels, using a very large held-out unlabeled dataset to train the student model can result in a model that outperforms more "traditional" approaches.

Deep Double Descent: 2020

Understanding Double Descent Requires a Fine-Grained Bias-Variance Decomposition. [paper]
- Ben Adlam, Jeffrey Pennington. NeurIPS 2020
- Key Word: Deep Double Descent; Bias-Variance.
- Digest
  Classical learning theory suggests that the optimal generalization performance of a machine learning model should occur at an intermediate model complexity, with simpler models exhibiting high bias and more complex models exhibiting high variance of the predictive function. However, such a simple trade-off does not adequately describe deep learning models that simultaneously attain low bias and variance in the heavily overparameterized regime. A primary obstacle in explaining this behavior is that deep learning algorithms typically involve multiple sources of randomness whose individual contributions are not visible in the total variance. To enable fine-grained analysis, we describe an interpretable, symmetric decomposition of the variance into terms associated with the randomness from sampling, initialization, and the labels.
Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win. [paper] [code]
- Utku Evci, Yani A. Ioannou, Cem Keskin, Yann Dauphin. AAAI 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Sparse Neural Networks (NNs) can match the generalization of dense NNs using a fraction of the compute/storage for inference, and also have the potential to enable efficient training. However, naively training unstructured sparse NNs from random initialization results in significantly worse generalization, with the notable exceptions of Lottery Tickets (LTs) and Dynamic Sparse Training (DST). Through our analysis of gradient flow during training we attempt to answer: (1) why training unstructured sparse networks from random initialization performs poorly and; (2) what makes LTs and DST the exceptions?
Multiple Descent: Design Your Own Generalization Curve. [paper]
- Lin Chen, Yifei Min, Mikhail Belkin, Amin Karbasi. NeurIPS 2021
- Key Word: Deep Double Descent.
- Digest
  This paper explores the generalization loss of linear regression in variably parameterized families of models, both under-parameterized and over-parameterized. We show that the generalization curve can have an arbitrary number of peaks, and moreover, locations of those peaks can be explicitly controlled. Our results highlight the fact that both classical U-shaped generalization curve and the recently observed double descent curve are not intrinsic properties of the model family. Instead, their emergence is due to the interaction between the properties of the data and the inductive biases of learning algorithms.
Early Stopping in Deep Networks: Double Descent and How to Eliminate it. [paper] [code]
- Reinhard Heckel, Fatih Furkan Yilmaz. ICLR 2021
- Key Word: Deep Double Descent; Early Stopping.
- Digest
  We show that such epoch-wise double descent arises for a different reason: It is caused by a superposition of two or more bias-variance tradeoffs that arise because different parts of the network are learned at different epochs, and eliminating this by proper scaling of stepsizes can significantly improve the early stopping performance. We show this analytically for i) linear regression, where differently scaled features give rise to a superposition of bias-variance tradeoffs, and for ii) a two-layer neural network, where the first and second layer each govern a bias-variance tradeoff. Inspired by this theory, we study two standard convolutional networks empirically and show that eliminating epoch-wise double descent through adjusting stepsizes of different layers improves the early stopping performance significantly.
Triple descent and the two kinds of overfitting: Where & why do they appear? [paper] [code]
- Stéphane d'Ascoli, Levent Sagun, Giulio Biroli.
- Key Word:Deep Double Descent.
- Digest
  In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when neural networks are applied to noisy regression tasks. The relative size of the peaks is governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent.
A Brief Prehistory of Double Descent. [paper]
- Marco Loog, Tom Viering, Alexander Mey, Jesse H. Krijthe, David M.J. Tax.
- Key Word: Deep Double Descent.
- Digest
  This letter draws attention to some original, earlier findings, of interest to double descent.
Double Trouble in Double Descent : Bias and Variance(s) in the Lazy Regime. [paper] [code]
- Stéphane d'Ascoli, Maria Refinetti, Giulio Biroli, Florent Krzakala. ICML 2020
- Key Word: Deep Double Descent; Bias-Variance.
- Digest
  Deep neural networks can achieve remarkable generalization performances while interpolating the training data perfectly. Rather than the U-curve emblematic of the bias-variance trade-off, their test error often follows a "double descent" - a mark of the beneficial role of overparametrization. In this work, we develop a quantitative theory for this phenomenon in the so-called lazy learning regime of neural networks, by considering the problem of learning a high-dimensional function with random features regression. We obtain a precise asymptotic expression for the bias-variance decomposition of the test error, and show that the bias displays a phase transition at the interpolation threshold, beyond which it remains constant.
Rethinking Bias-Variance Trade-off for Generalization of Neural Networks. [paper] [code]
- Zitong Yang, Yaodong Yu, Chong You, Jacob Steinhardt, Yi Ma. ICML 2020
- Key Word: Deep Double Descent; Bias-Variance.
- Digest
  The classical bias-variance trade-off predicts that bias decreases and variance increase with model complexity, leading to a U-shaped risk curve. Recent work calls this into question for neural networks and other over-parameterized models, for which it is often observed that larger models generalize better. We provide a simple explanation for this by measuring the bias and variance of neural networks: while the bias is monotonically decreasing as in the classical theory, the variance is unimodal or bell-shaped: it increases then decreases with the width of the network.
The Curious Case of Adversarially Robust Models: More Data Can Help, Double Descend, or Hurt Generalization. [paper]
- Yifei Min, Lin Chen, Amin Karbasi. UAI 2021
- Key Word: Deep Double Descent.
- Digest
  We challenge this conventional belief and show that more training data can hurt the generalization of adversarially robust models in the classification problems. We first investigate the Gaussian mixture classification with a linear loss and identify three regimes based on the strength of the adversary. In the weak adversary regime, more data improves the generalization of adversarially robust models. In the medium adversary regime, with more training data, the generalization loss exhibits a double descent curve, which implies the existence of an intermediate stage where more training data hurts the generalization. In the strong adversary regime, more data almost immediately causes the generalization error to increase.

Deep Double Descent: 2019

Deep Double Descent: Where Bigger Models and More Data Hurt. [paper]
- Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever. ICLR 2020
- Key Word: Deep Double Descent.
- Digest
  We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.

Deep Double Descent: 2018

Reconciling modern machine learning practice and the bias-variance trade-off. [paper]
- Mikhail Belkin, Daniel Hsu, Siyuan Ma, Soumik Mandal. PNAS
- Key Word: Bias-Variance; Over-Parameterization.
- Digest
  In this paper, we reconcile the classical understanding and the modern practice within a unified performance curve. This "double descent" curve subsumes the textbook U-shaped bias-variance trade-off curve by showing how increasing model capacity beyond the point of interpolation results in improved performance.
A Modern Take on the Bias-Variance Tradeoff in Neural Networks. [paper]
- Brady Neal, Sarthak Mittal, Aristide Baratin, Vinayak Tantia, Matthew Scicluna, Simon Lacoste-Julien, Ioannis Mitliagkas.
- Key Word: Bias-Variance; Over-Parameterization.
- Digest
  The bias-variance tradeoff tells us that as model complexity increases, bias falls and variances increases, leading to a U-shaped test error curve. However, recent empirical results with over-parameterized neural networks are marked by a striking absence of the classic U-shaped test error curve: test error keeps decreasing in wider networks. Motivated by the shaky evidence used to support this claim in neural networks, we measure bias and variance in the modern setting. We find that both bias and variance can decrease as the number of parameters grows. To better understand this, we introduce a new decomposition of the variance to disentangle the effects of optimization and data sampling.

Lottery Ticket Hypothesis

Lottery Ticket Hypothesis: 2022

Revisit Kernel Pruning with Lottery Regulated Grouped Convolutions. [paper] [code]
- Shaochen Zhong, Guanqun Zhang, Ningjia Huang, Shuai Xu. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We revisit the idea of kernel pruning, a heavily overlooked approach under the context of structured pruning. This is because kernel pruning will naturally introduce sparsity to filters within the same convolutional layer — thus, making the remaining network no longer dense. We address this problem by proposing a versatile grouped pruning framework where we first cluster filters from each convolutional layer into equal-sized groups, prune the grouped kernels we deem unimportant from each filter group, then permute the remaining filters to form a densely grouped convolutional architecture (which also enables the parallel computing capability) for fine-tuning.
Proving the Lottery Ticket Hypothesis for Convolutional Neural Networks. [paper]
- Arthur da Cunha, Emanuele Natale, Laurent Viennot, Laurent_Viennot. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Recent theoretical works proved an even stronger version: every sufficiently overparameterized (dense) neural network contains a subnetwork that, even without training, achieves accuracy comparable to that of the trained large network. These works left as an open problem to extend the result to convolutional neural networks (CNNs). In this work we provide such generalization by showing that, with high probability, it is possible to approximate any CNN by pruning a random CNN whose size is larger by a logarithmic factor.
Audio Lottery: Speech Recognition Made Ultra-Lightweight, Noise-Robust, and Transferable. [paper] [code]
- Shaojin Ding, Tianlong Chen, Zhangyang Wang. ICLR 2022
- Key Word: Lottery Ticket Hypothesis; Speech Recognition.
- Digest
  We investigate the tantalizing possibility of using lottery ticket hypothesis to discover lightweight speech recognition models, that are (1) robust to various noise existing in speech; (2) transferable to fit the open-world personalization; and 3) compatible with structured sparsity.
Strong Lottery Ticket Hypothesis with ε--perturbation. [paper]
- Zheyang Xiong, Fangshuo Liao, Anastasios Kyrillidis.
- Key Word: Lottery Ticket Hypothesis.
- Digest
  The strong Lottery Ticket Hypothesis (LTH) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around initialization.
Large Models are Parsimonious Learners: Activation Sparsity in Trained Transformers. [paper]
- Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv Kumar.
- Key Word: Sparse Activation; Large Models; Transformers.
- Digest
  This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. By activation map we refer to the intermediate output of the multi-layer perceptrons (MLPs) after a ReLU activation function, and by "sparse" we mean that on average very few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each input to MLP.
Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask? [paper]
- Mansheej Paul, Feng Chen, Brett W. Larsen, Jonathan Frankle, Surya Ganguli, Gintare Karolina Dziugaite.
- Key Word: Lottery Ticket Hypothesis; Mode Connectivity.
- Digest
  First, we find that—at higher sparsities—pairs of pruned networks at successive pruning iterations are connected by a linear path with zero error barrier if and only if they are matching. This indicates that masks found at the end of training convey the identity of an axial subspace that intersects a desired linearly connected mode of a matching sublevel set. Second, we show SGD can exploit this information due to a strong form of robustness: it can return to this mode despite strong perturbations early in training. Third, we show how the flatness of the error landscape at the end of training determines a limit on the fraction of weights that can be pruned at each iteration of IMP. Finally, we show that the role of retraining in IMP is to find a network with new small weights to prune.
How Erdös and Rényi Win the Lottery. [paper]
- Advait Gadhikar, Sohum Mukherjee, Rebekka Burkholz.
- Key Word: Lottery Ticket Hypothesis; Erdös-Rényi Random Graphs.
- Digest
  Random masks define surprisingly effective sparse neural network models, as has been shown empirically. The resulting Erdös-Rényi (ER) random graphs can often compete with dense architectures and state-of-the-art lottery ticket pruning algorithms struggle to outperform them, even though the random baselines do not rely on computationally expensive pruning-training iterations but can be drawn initially without significant computational overhead. We offer a theoretical explanation of how such ER masks can approximate arbitrary target networks if they are wider by a logarithmic factor in the inverse sparsity 1/log(1/sparsity).
SparCL: Sparse Continual Learning on the Edge. [paper]
- Zifeng Wang, Zheng Zhan, Yifan Gong, Geng Yuan, Wei Niu, Tong Jian, Bin Ren, Stratis Ioannidis, Yanzhi Wang, Jennifer Dy. NeurIPS 2022
- Key Word: Continual Learning; Sparse Training.
- Digest
  We propose a novel framework called Sparse Continual Learning(SparCL), which is the first study that leverages sparsity to enable cost-effective continual learning on edge devices. SparCL achieves both training acceleration and accuracy preservation through the synergy of three aspects: weight sparsity, data efficiency, and gradient sparsity. Specifically, we propose task-aware dynamic masking (TDM) to learn a sparse network throughout the entire CL process, dynamic data removal (DDR) to remove less informative training data, and dynamic gradient masking (DGM) to sparsify the gradient updates.
One-shot Network Pruning at Initialization with Discriminative Image Patches. [paper]
- Yinan Yang, Ying Ji, Yu Wang, Heng Qi, Jien Kato.
- Key Word: One-Shot Network Pruning.
- Digest
  We propose two novel methods, Discriminative One-shot Network Pruning (DOP) and Super Stitching, to prune the network by high-level visual discriminative image patches. Our contributions are as follows. (1) Extensive experiments reveal that OPaI is data-dependent. (2) Super Stitching performs significantly better than the original OPaI method on benchmark ImageNet, especially in a highly compressed model.
SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning. [paper] [code]
- Haoran You, Baopu Li, Zhanyi Sun, Xu Ouyang, Yingyan Lin. ECCV 2022
- Key Word: Lottery Ticket Hypothesis; Neural Architecture Search.
- Digest
  We discover for the first time that both efficient DNNs and their lottery subnetworks (i.e., lottery tickets) can be directly identified from a supernet, which we term as SuperTickets, via a two-in-one training scheme with jointly architecture searching and parameter pruning. Moreover, we develop a progressive and unified SuperTickets identification strategy that allows the connectivity of subnetworks to change during supernet training, achieving better accuracy and efficiency trade-offs than conventional sparse training.
Lottery Ticket Hypothesis for Spiking Neural Networks. [paper]
- Youngeun Kim, Yuhang Li, Hyoungseob Park, Yeshwanth Venkatesha, Ruokai Yin, Priyadarshini Panda. ECCV 2022
- Key Word: Lottery Ticket Hypothesis; Spiking Neural Networks.
- Digest
  Spiking Neural Networks (SNNs) have recently emerged as a new generation of low-power deep neural networks where binary spikes convey information across multiple timesteps. Pruning for SNNs is highly important as they become deployed on a resource-constraint mobile/edge device. The previous SNN pruning works focus on shallow SNNs (2~6 layers), however, deeper SNNs (>16 layers) are proposed by state-of-the-art SNN works, which is difficult to be compatible with the current pruning work. To scale up a pruning technique toward deep SNNs, we investigate Lottery Ticket Hypothesis (LTH) which states that dense networks contain smaller subnetworks (i.e., winning tickets) that achieve comparable performance to the dense networks. Our studies on LTH reveal that the winning tickets consistently exist in deep SNNs across various datasets and architectures, providing up to 97% sparsity without huge performance degradation.
Winning the Lottery Ahead of Time: Efficient Early Network Pruning. [paper]
- John Rachwan, Daniel Zügner, Bertrand Charpentier, Simon Geisler, Morgane Ayle, Stephan Günnemann. ICML 2022
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
- Digest
  Although state-of-the-art pruning methods extract highly sparse models, they neglect two main challenges: (1) the process of finding these sparse models is often very expensive; (2) unstructured pruning does not provide benefits in terms of GPU memory, training time, or carbon emissions. We propose Early Compression via Gradient Flow Preservation (EarlyCroP), which efficiently extracts state-of-the-art sparse models before or early in training addressing challenge (1), and can be applied in a structured manner addressing challenge (2). This enables us to train sparse networks on commodity GPUs whose dense versions would be too large, thereby saving costs and reducing hardware requirements.
"Understanding Robustness Lottery": A Comparative Visual Analysis of Neural Network Pruning Approaches. [paper]
- Zhimin Li, Shusen Liu, Xin Yu, Kailkhura Bhavya, Jie Cao, Diffenderfer James Daniel, Peer-Timo Bremer, Valerio Pascucci.
- Key Word: Lottery Ticket Hypothesis; Out-of-Distribution Generalization; Visualization.
- Digest
  This work aims to shed light on how different pruning methods alter the network's internal feature representation, and the corresponding impact on model performance. To provide a meaningful comparison and characterization of model feature space, we use three geometric metrics that are decomposed from the common adopted classification loss. With these metrics, we design a visualization system to highlight the impact of pruning on model prediction as well as the latent feature embedding.
Data-Efficient Double-Win Lottery Tickets from Robust Pre-training. [paper] [code]
- Tianlong Chen, Zhenyu Zhang, Sijia Liu, Yang Zhang, Shiyu Chang, Zhangyang Wang. ICML 2022
- Key Word: Lottery Ticket Hypothesis; Adversarial Training; Robust Pre-training.
- Digest
  We formulate a more rigorous concept, Double-Win Lottery Tickets, in which a located subnetwork from a pre-trained model can be independently transferred on diverse downstream tasks, to reach BOTH the same standard and robust generalization, under BOTH standard and adversarial training regimes, as the full pre-trained model can do. We comprehensively examine various pre-training mechanisms and find that robust pre-training tends to craft sparser double-win lottery tickets with superior performance over the standard counterparts.
HideNseek: Federated Lottery Ticket via Server-side Pruning and Sign Supermask. [paper]
- Anish K. Vallapuram, Pengyuan Zhou, Young D. Kwon, Lik Hang Lee, Hengwei Xu, Pan Hui.
- Key Word: Lottery Ticket Hypothesis; Federated Learning.
- Digest
  We propose HideNseek which employs one-shot data-agnostic pruning at initialization to get a subnetwork based on weights' synaptic saliency. Each client then optimizes a sign supermask multiplied by the unpruned weights to allow faster convergence with the same compression rates as state-of-the-art.
Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks. [paper] [code]
- Mansheej Paul, Brett W. Larsen, Surya Ganguli, Jonathan Frankle, Gintare Karolina Dziugaite. NeurIPS 2022
- Key Word: Lottery Ticket Hypothesis; Pre-training.
- Digest
  We seek to understand how this early phase of pre-training leads to a good initialization for IMP both through the lens of the data distribution and the loss landscape geometry. Empirically we observe that, holding the number of pre-training iterations constant, training on a small fraction of (randomly chosen) data suffices to obtain an equally good initialization for IMP. We additionally observe that by pre-training only on "easy" training data, we can decrease the number of steps necessary to find a good initialization for IMP compared to training on the full dataset or a randomly chosen subset. Finally, we identify novel properties of the loss landscape of dense networks that are predictive of IMP performance, showing in particular that more examples being linearly mode connected in the dense network correlates well with good initializations for IMP.
Analyzing Lottery Ticket Hypothesis from PAC-Bayesian Theory Perspective. [paper]
- Keitaro Sakamoto, Issei Sato. NeurIPS 2022
- Key Word: Lottery Ticket Hypothesis; PAC-Bayes.
- Digest
  We confirm this hypothesis and show that the PAC-Bayesian theory can provide an explicit understanding of the relationship between LTH and generalization behavior. On the basis of our experimental findings that flatness is useful for improving accuracy and robustness to label noise and that the distance from the initial weights is deeply involved in winning tickets, we offer the PAC-Bayes bound using a spike-and-slab distribution to analyze winning tickets.
Dual Lottery Ticket Hypothesis. [paper] [code]
- Yue Bai, Huan Wang, Zhiqiang Tao, Kunpeng Li, Yun Fu. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
- Digest
  This paper articulates a Dual Lottery Ticket Hypothesis (DLTH) as a dual format of original Lottery Ticket Hypothesis (LTH). Correspondingly, a simple regularization based sparse network training strategy, Random Sparse Network Transformation (RST), is proposed to validate DLTH and enhance sparse network training.
Rare Gems: Finding Lottery Tickets at Initialization. [paper]
- Kartik Sreenivasan, Jy-yong Sohn, Liu Yang, Matthew Grinde, Alliot Nagle, Hongyi Wang, Eric Xing, Kangwook Lee, Dimitris Papailiopoulos. NeurIPS 2022
- Key Word: Lottery Ticket Hypothesis; Sanity Checks; Pruning at Initialization.
- Digest
  Finding lottery tickets that train to better accuracy compared to simple baselines remains an open problem. In this work, we resolve this open problem by proposing Gem-Miner which finds lottery tickets at initialization that beat current baselines. Gem-Miner finds lottery tickets trainable to accuracy competitive or better than Iterative Magnitude Pruning (IMP), and does so up to 19× faster.
Reconstruction Task Finds Universal Winning Tickets. [paper]
- Ruichen Li, Binghui Li, Qi Qian, Liwei Wang.
- Key Word: Lottery Ticket Hypothesis; Self-Supervision.
- Digest
  We show that the image-level pretrain task is not capable of pruning models for diverse downstream tasks. To mitigate this problem, we introduce image reconstruction, a pixel-level task, into the traditional pruning framework. Concretely, an autoencoder is trained based on the original model, and then the pruning process is optimized with both autoencoder and classification losses.
Finding Dynamics Preserving Adversarial Winning Tickets. [paper] [code]
- Xupeng Shi, Pengfei Zheng, A. Adam Ding, Yuan Gao, Weizhong Zhang. AISTATS 2022
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
- Digest
  Based on recent works of Neural Tangent Kernel (NTK), we systematically study the dynamics of adversarial training and prove the existence of trainable sparse sub-network at initialization which can be trained to be adversarial robust from scratch. This theoretically verifies the lottery ticket hypothesis in adversarial context and we refer such sub-network structure as Adversarial Winning Ticket (AWT). We also show empirical evidences that AWT preserves the dynamics of adversarial training and achieve equal performance as dense adversarial training.

Lottery Ticket Hypothesis: 2021

Plant 'n' Seek: Can You Find the Winning Ticket? [paper] [code]
- Jonas Fischer, Rebekka Burkholz. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Currently, such algorithms are primarily evaluated on imaging data, for which we lack ground truth information and thus the understanding of how sparse lottery tickets could be. To fill this gap, we develop a framework that allows us to plant and hide winning tickets with desirable properties in randomly initialized neural networks. To analyze the ability of state-of-the-art pruning to identify tickets of extreme sparsity, we design and hide such tickets solving four challenging tasks.
On the Existence of Universal Lottery Tickets. [paper] [code]
- Rebekka Burkholz, Nilanjana Laha, Rajarshi Mukherjee, Alkis Gotovos. ICLR 2022
- Key Word: Lottery Ticket Hypothesis.
- Digest
  The lottery ticket hypothesis conjectures the existence of sparse subnetworks of large randomly initialized deep neural networks that can be successfully trained in isolation. Recent work has experimentally observed that some of these tickets can be practically reused across a variety of tasks, hinting at some form of universality. We formalize this concept and theoretically prove that not only do such universal tickets exist but they also do not require further training.
Universality of Winning Tickets: A Renormalization Group Perspective. [paper]
- William T. Redman, Tianlong Chen, Zhangyang Wang, Akshunna S. Dogra. ICML 2022
- Key Word: Lottery Ticket Hypothesis; Renormalization Group Theory.
- Digest
  Foundational work on the Lottery Ticket Hypothesis has suggested an exciting corollary: winning tickets found in the context of one task can be transferred to similar tasks, possibly even across different architectures. This has generated broad interest, but methods to study this universality are lacking. We make use of renormalization group theory, a powerful tool from theoretical physics, to address this need. We find that iterative magnitude pruning, the principal algorithm used for discovering winning tickets, is a renormalization group scheme, and can be viewed as inducing a flow in parameter space.
How many degrees of freedom do we need to train deep networks: a loss landscape perspective. [paper] [code]
- Brett W. Larsen, Stanislav Fort, Nic Becker, Surya Ganguli. ICLR 2022
- Key Word: Loss Landscape; Lottery Ticket Hypothesis.
- Digest
  A variety of recent works, spanning pruning, lottery tickets, and training within random subspaces, have shown that deep neural networks can be trained using far fewer degrees of freedom than the total number of parameters. We analyze this phenomenon for random subspaces by first examining the success probability of hitting a training loss sublevel set when training within a random subspace of a given training dimensionality.
A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness. [paper]
- James Diffenderfer, Brian R. Bartoldson, Shreya Chaganti, Jize Zhang, Bhavya Kailkhura. NeurIPS 2021
- Key Word: Lottery Ticket Hypothesis; Out-of-Distribution Generalization.
- Digest
  We perform a large-scale analysis of popular model compression techniques which uncovers several intriguing patterns. Notably, in contrast to traditional pruning approaches (e.g., fine tuning and gradual magnitude pruning), we find that "lottery ticket-style" approaches can surprisingly be used to produce CARDs, including binary-weight CARDs. Specifically, we are able to create extremely compact CARDs that, compared to their larger counterparts, have similar test accuracy and matching (or better) robustness -- simply by pruning and (optionally) quantizing.
Efficient Lottery Ticket Finding: Less Data is More. [paper] [code]
- Zhenyu Zhang, Xuxi Chen, Tianlong Chen, Zhangyang Wang. ICML 2021
- Key Word: Lottery Ticket Hypothesis.
- Digest
  This paper explores a new perspective on finding lottery tickets more efficiently, by doing so only with a specially selected subset of data, called Pruning-Aware Critical set (PrAC set), rather than using the full training set. The concept of PrAC set was inspired by the recent observation, that deep networks have samples that are either hard to memorize during training, or easy to forget during pruning.
A Probabilistic Approach to Neural Network Pruning. [paper]
- Xin Qian, Diego Klabjan. ICML 2021
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We theoretically study the performance of two pruning techniques (random and magnitude-based) on FCNs and CNNs. Given a target network whose weights are independently sampled from appropriate distributions, we provide a universal approach to bound the gap between a pruned and the target network in a probabilistic sense. The results establish that there exist pruned networks with expressive power within any specified bound from the target network.
On Lottery Tickets and Minimal Task Representations in Deep Reinforcement Learning. [paper]
- Marc Aurel Vischer, Robert Tjarko Lange, Henning Sprekeler. ICLR 2022
- Key Word: Reinforcement Learning; Lottery Ticket Hypothesis.
- Digest
  The lottery ticket hypothesis questions the role of overparameterization in supervised deep learning. But how is the performance of winning lottery tickets affected by the distributional shift inherent to reinforcement learning problems? In this work, we address this question by comparing sparse agents who have to address the non-stationarity of the exploration-exploitation problem with supervised agents trained to imitate an expert. We show that feed-forward networks trained with behavioural cloning compared to reinforcement learning can be pruned to higher levels of sparsity without performance degradation.
Multi-Prize Lottery Ticket Hypothesis: Finding Accurate Binary Neural Networks by Pruning A Randomly Weighted Network. [paper] [code]
- James Diffenderfer, Bhavya Kailkhura. ICLR 2021
- Key Word: Lottery Ticket Hypothesis; Binary Neural Networks.
- Digest
  This provides a new paradigm for learning compact yet highly accurate binary neural networks simply by pruning and quantizing randomly weighted full precision neural networks. We also propose an algorithm for finding multi-prize tickets (MPTs) and test it by performing a series of experiments on CIFAR-10 and ImageNet datasets. Empirical results indicate that as models grow deeper and wider, multi-prize tickets start to reach similar (and sometimes even higher) test accuracy compared to their significantly larger and full-precision counterparts that have been weight-trained.
Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training. [paper] [code]
- Shiwei Liu, Lu Yin, Decebal Constantin Mocanu, Mykola Pechenizkiy. ICML 2021
- Key Word: Lottery Ticket Hypothesis.
- Digest
  In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training.
Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks. [paper]
- Torsten Hoefler, Dan Alistarh, Tal Ben-Nun, Nikoli Dryden, Alexandra Peste.
- Key Word: Sparsity; Survey.
- Digest
  We survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward.
A Unified Paths Perspective for Pruning at Initialization. [paper]
- Thomas Gebhart, Udit Saxena, Paul Schrater.
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
- Digest
  Leveraging recent theoretical approximations provided by the Neural Tangent Kernel, we unify a number of popular approaches for pruning at initialization under a single path-centric framework. We introduce the Path Kernel as the data-independent factor in a decomposition of the Neural Tangent Kernel and show the global structure of the Path Kernel can be computed efficiently. This Path Kernel decomposition separates the architectural effects from the data-dependent effects within the Neural Tangent Kernel, providing a means to predict the convergence dynamics of a network from its architecture alone.

Lottery Ticket Hypothesis: 2020

PHEW: Constructing Sparse Networks that Learn Fast and Generalize Well without Training Data. [paper] [code]
- Shreyas Malakarjun Patil, Constantine Dovrolis. ICLR 2021
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
- Digest
  Our work is based on a recently proposed decomposition of the Neural Tangent Kernel (NTK) that has decoupled the dynamics of the training process into a data-dependent component and an architecture-dependent kernel - the latter referred to as Path Kernel. That work has shown how to design sparse neural networks for faster convergence, without any training data, using the Synflow-L2 algorithm. We first show that even though Synflow-L2 is optimal in terms of convergence, for a given network density, it results in sub-networks with "bottleneck" (narrow) layers - leading to poor performance as compared to other data-agnostic methods that use the same number of parameters.
A Gradient Flow Framework For Analyzing Network Pruning. [paper] [code]
- Ekdeep Singh Lubana, Robert P. Dick. ICLR 2021
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Recent network pruning methods focus on pruning models early-on in training. To estimate the impact of removing a parameter, these methods use importance measures that were originally designed to prune trained models. Despite lacking justification for their use early-on in training, such measures result in surprisingly low accuracy loss. To better explain this behavior, we develop a general framework that uses gradient flow to unify state-of-the-art importance measures through the norm of model parameters.
Sanity-Checking Pruning Methods: Random Tickets can Win the Jackpot. [paper] [code]
- Jingtong Su, Yihang Chen, Tianle Cai, Tianhao Wu, Ruiqi Gao, Liwei Wang, Jason D. Lee. NeurIPS 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call "initial tickets"), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance.
Pruning Neural Networks at Initialization: Why are We Missing the Mark? [paper]
- Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin. ICLR 2021
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Recent work has explored the possibility of pruning neural networks at initialization. We assess proposals for doing so: SNIP (Lee et al., 2019), GraSP (Wang et al., 2020), SynFlow (Tanaka et al., 2020), and magnitude pruning. Although these methods surpass the trivial baseline of random pruning, they remain below the accuracy of magnitude pruning after training, and we endeavor to understand why. We show that, unlike pruning after training, randomly shuffling the weights these methods prune within each layer or sampling new initial values preserves or improves accuracy. As such, the per-weight pruning decisions made by these methods can be replaced by a per-layer choice of the fraction of weights to prune. This property suggests broader challenges with the underlying pruning heuristics, the desire to prune at initialization, or both.
ESPN: Extremely Sparse Pruned Networks. [paper] [code]
- Minsu Cho, Ameya Joshi, Chinmay Hegde.
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Deep neural networks are often highly overparameterized, prohibiting their use in compute-limited systems. However, a line of recent works has shown that the size of deep networks can be considerably reduced by identifying a subset of neuron indicators (or mask) that correspond to significant weights prior to training. We demonstrate that an simple iterative mask discovery method can achieve state-of-the-art compression of very deep networks. Our algorithm represents a hybrid approach between single shot network pruning methods (such as SNIP) with Lottery-Ticket type approaches. We validate our approach on several datasets and outperform several existing pruning approaches in both test accuracy and compression ratio.
Logarithmic Pruning is All You Need. [paper]
- Laurent Orseau, Marcus Hutter, Omar Rivasplata. NeurIPS 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  The Lottery Ticket Hypothesis is a conjecture that every large neural network contains a subnetwork that, when trained in isolation, achieves comparable performance to the large network. An even stronger conjecture has been proven recently: Every sufficiently overparameterized network contains a subnetwork that, at random initialization, but without training, achieves comparable accuracy to the trained large network. This latter result, however, relies on a number of strong assumptions and guarantees a polynomial factor on the size of the large network compared to the target function. In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds:the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork.
Exploring Weight Importance and Hessian Bias in Model Pruning. [paper]
- Mingchen Li, Yahya Sattar, Christos Thrampoulidis, Samet Oymak.
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Model pruning is an essential procedure for building compact and computationally-efficient machine learning models. A key feature of a good pruning algorithm is that it accurately quantifies the relative importance of the model weights. While model pruning has a rich history, we still don't have a full grasp of the pruning mechanics even for relatively simple problems involving linear models or shallow neural nets. In this work, we provide a principled exploration of pruning by building on a natural notion of importance.
Progressive Skeletonization: Trimming more fat from a network at initialization. [paper] [code]
- Pau de Jorge, Amartya Sanyal, Harkirat S. Behl, Philip H.S. Torr, Gregory Rogez, Puneet K. Dokania. ICLR 2021
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Recent studies have shown that skeletonization (pruning parameters) of networks at initialization provides all the practical benefits of sparsity both at inference and training time, while only marginally degrading their performance. However, we observe that beyond a certain level of sparsity (approx 95%), these approaches fail to preserve the network performance, and to our surprise, in many cases perform even worse than trivial random pruning. To this end, we propose an objective to find a skeletonized network with maximum foresight connection sensitivity (FORCE) whereby the trainability, in terms of connection sensitivity, of a pruned network is taken into consideration.
Pruning neural networks without any data by iteratively conserving synaptic flow. [paper] [code]
- Hidenori Tanaka, Daniel Kunin, Daniel L. K. Yamins, Surya Ganguli.
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Recent works have identified, through an expensive sequence of training and pruning cycles, the existence of winning lottery tickets or sparse trainable subnetworks at initialization. This raises a foundational question: can we identify highly sparse trainable subnetworks at initialization, without ever training, or indeed without ever looking at the data? We provide an affirmative answer to this question through theory driven algorithm design.
Finding trainable sparse networks through Neural Tangent Transfer. [paper] [code]
- Tianlin Liu, Friedemann Zenke. ICML 2020
- Key Word: Lottery Ticket Hypothesis; Neural Tangent Kernel.
- Digest
  We introduce Neural Tangent Transfer, a method that instead finds trainable sparse networks in a label-free manner. Specifically, we find sparse networks whose training dynamics, as characterized by the neural tangent kernel, mimic those of dense networks in function space. Finally, we evaluate our label-agnostic approach on several standard classification tasks and show that the resulting sparse networks achieve higher classification performance while converging faster.
What is the State of Neural Network Pruning? [paper] [code]
- Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, John Guttag. MLSys 2020
- Key Word: Lottery Ticket Hypothesis; Survey.
- Digest
  Neural network pruning---the task of reducing the size of a network by removing parameters---has been the subject of a great deal of work in recent years. We provide a meta-analysis of the literature, including an overview of approaches to pruning and consistent findings in the literature. After aggregating results across 81 papers and pruning hundreds of models in controlled conditions, our clearest finding is that the community suffers from a lack of standardized benchmarks and metrics. This deficiency is substantial enough that it is hard to compare pruning techniques to one another or determine how much progress the field has made over the past three decades. To address this situation, we identify issues with current practices, suggest concrete remedies, and introduce ShrinkBench, an open-source framework to facilitate standardized evaluations of pruning methods.
Comparing Rewinding and Fine-tuning in Neural Network Pruning. [paper] [code]
- Alex Renda, Jonathan Frankle, Michael Carbin. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We compare fine-tuning to alternative retraining techniques. Weight rewinding (as proposed by Frankle et al., (2019)), rewinds unpruned weights to their values from earlier in training and retrains them from there using the original training schedule. Learning rate rewinding (which we propose) trains the unpruned weights from their final values using the same learning rate schedule as weight rewinding. Both rewinding techniques outperform fine-tuning, forming the basis of a network-agnostic pruning algorithm that matches the accuracy and compression ratios of several more network-specific state-of-the-art techniques.
Good Subnetworks Provably Exist: Pruning via Greedy Forward Selection. [paper] [code]
- Mao Ye, Chengyue Gong, Lizhen Nie, Denny Zhou, Adam Klivans, Qiang Liu. ICML 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Recent empirical works show that large deep neural networks are often highly redundant and one can find much smaller subnetworks without a significant drop of accuracy. However, most existing methods of network pruning are empirical and heuristic, leaving it open whether good subnetworks provably exist, how to find them efficiently, and if network pruning can be provably better than direct training using gradient descent. We answer these problems positively by proposing a simple greedy selection approach for finding good subnetworks, which starts from an empty network and greedily adds important neurons from the large network.
The Early Phase of Neural Network Training. [paper] [code]
- Jonathan Frankle, David J. Schwab, Ari S. Morcos. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We find that, within this framework, deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations.
Robust Pruning at Initialization. [paper]
- Soufiane Hayou, Jean-Francois Ton, Arnaud Doucet, Yee Whye Teh.
- Key Word: Lottery Ticket Hypothesis.
- Digest
  we provide a comprehensive theoretical analysis of Magnitude and Gradient based pruning at initialization and training of sparse architectures. This allows us to propose novel principled approaches which we validate experimentally on a variety of NN architectures.
Picking Winning Tickets Before Training by Preserving Gradient Flow. [paper] [code]
- Chaoqi Wang, Guodong Zhang, Roger Grosse. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We aim to prune networks at initialization, thereby saving resources at training time as well. Specifically, we argue that efficient training requires preserving the gradient flow through the network. This leads to a simple but effective pruning criterion we term Gradient Signal Preservation (GraSP).
Lookahead: A Far-Sighted Alternative of Magnitude-based Pruning. [paper] [code]
- Sejun Park, Jaeho Lee, Sangwoo Mo, Jinwoo Shin. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Magnitude-based pruning is one of the simplest methods for pruning neural networks. Despite its simplicity, magnitude-based pruning and its variants demonstrated remarkable performances for pruning modern architectures. Based on the observation that magnitude-based pruning indeed minimizes the Frobenius distortion of a linear operator corresponding to a single layer, we develop a simple pruning method, coined lookahead pruning, by extending the single layer optimization to a multi-layer optimization.

Lottery Ticket Hypothesis: 2019

Linear Mode Connectivity and the Lottery Ticket Hypothesis. [paper]
- Jonathan Frankle, Gintare Karolina Dziugaite, Daniel M. Roy, Michael Carbin. ICML 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We study whether a neural network optimizes to the same, linearly connected minimum under different samples of SGD noise (e.g., random data order and augmentation). We find that standard vision models become stable to SGD noise in this way early in training. From then on, the outcome of optimization is determined to a linearly connected region. We use this technique to study iterative magnitude pruning (IMP), the procedure used by work on the lottery ticket hypothesis to identify subnetworks that could have trained in isolation to full accuracy.
What's Hidden in a Randomly Weighted Neural Network? [paper] [code]
- Vivek Ramanujan, Mitchell Wortsman, Aniruddha Kembhavi, Ali Farhadi, Mohammad Rastegari. CVPR 2020
- Key Word: Lottery Ticket Hypothesis; Neural Architecture Search; Weight Agnositic Neural Networks.
- Digest
  Hidden in a randomly weighted Wide ResNet-50 we show that there is a subnetwork (with random weights) that is smaller than, but matches the performance of a ResNet-34 trained on ImageNet. Not only do these "untrained subnetworks" exist, but we provide an algorithm to effectively find them.
Drawing Early-Bird Tickets: Towards More Efficient Training of Deep Networks. [paper] [code]
- Haoran You, Chaojian Li, Pengfei Xu, Yonggan Fu, Yue Wang, Xiaohan Chen, Richard G. Baraniuk, Zhangyang Wang, Yingyan Lin. ICLR 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We discover for the first time that the winning tickets can be identified at the very early training stage, which we term as early-bird (EB) tickets, via low-cost training schemes (e.g., early stopping and low-precision training) at large learning rates. Our finding of EB tickets is consistent with recently reported observations that the key connectivity patterns of neural networks emerge early.
Rigging the Lottery: Making All Tickets Winners. [paper] [code]
- Utku Evci, Trevor Gale, Jacob Menick, Pablo Samuel Castro, Erich Elsen. ICML 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We introduce a method to train sparse neural networks with a fixed parameter count and a fixed computational cost throughout training, without sacrificing accuracy relative to existing dense-to-sparse training methods. Our method updates the topology of the sparse network during training by using parameter magnitudes and infrequent gradient calculations. We show that this approach requires fewer floating-point operations (FLOPs) to achieve a given level of accuracy compared to prior techniques.
The Difficulty of Training Sparse Neural Networks. [paper]
- Utku Evci, Fabian Pedregosa, Aidan Gomez, Erich Elsen.
- Key Word: Pruning.
- Digest
  We investigate the difficulties of training sparse neural networks and make new observations about optimization dynamics and the energy landscape within the sparse regime. Recent work of has shown that sparse ResNet-50 architectures trained on ImageNet-2012 dataset converge to solutions that are significantly worse than those found by pruning. We show that, despite the failure of optimizers, there is a linear path with a monotonically decreasing objective from the initialization to the "good" solution.
A Signal Propagation Perspective for Pruning Neural Networks at Initialization. [paper] [code]
- Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, Philip H. S. Torr. ICLR 2020
- Key Word: Lottery Ticket Hypothesis; Mean Field Theory.
- Digest
  In this work, by noting connection sensitivity as a form of gradient, we formally characterize initialization conditions to ensure reliable connection sensitivity measurements, which in turn yields effective pruning results. Moreover, we analyze the signal propagation properties of the resulting pruned networks and introduce a simple, data-free method to improve their trainability.
One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers. [paper]
- Ari S. Morcos, Haonan Yu, Michela Paganini, Yuandong Tian. NeurIPS 2019
- Key Word: Lottery Ticket Hypothesis.
- Digest
  Perhaps surprisingly, we found that, within the natural images domain, winning ticket initializations generalized across a variety of datasets, including Fashion MNIST, SVHN, CIFAR-10/100, ImageNet, and Places365, often achieving performance close to that of winning tickets generated on the same dataset.
Deconstructing Lottery Tickets: Zeros, Signs, and the Supermask. [paper] [code]
- Hattie Zhou, Janice Lan, Rosanne Liu, Jason Yosinski. NeurIPS 2019
- Key Word: Lottery Ticket Hypothesis.
- Digest
  In this paper, we have studied how three components to LT-style network pruning—mask criterion, treatment of kept weights during retraining (mask-1 action), and treatment of pruned weights during retraining (mask-0 action)—come together to produce sparse and performant subnetworks.
The State of Sparsity in Deep Neural Networks. [paper] [code]
- Trevor Gale, Erich Elsen, Sara Hooker.
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We rigorously evaluate three state-of-the-art techniques for inducing sparsity in deep neural networks on two large-scale learning tasks: Transformer trained on WMT 2014 English-to-German, and ResNet-50 trained on ImageNet. Across thousands of experiments, we demonstrate that complex techniques (Molchanov et al., 2017; Louizos et al., 2017b) shown to yield high compression rates on smaller datasets perform inconsistently, and that simple magnitude pruning approaches achieve comparable or better results.

Lottery Ticket Hypothesis: 2018

SNIP: Single-shot Network Pruning based on Connection Sensitivity. [paper] [code]
- Namhoon Lee, Thalaiyasingam Ajanthan, Philip H. S. Torr. ICLR 2019
- Key Word: Lottery Ticket Hypothesis.
- Digest
  In this work, we present a new approach that prunes a given network once at initialization prior to training. To achieve this, we introduce a saliency criterion based on connection sensitivity that identifies structurally important connections in the network for the given task.
The Lottery Ticket Hypothesis: Finding Sparse, Trainable Neural Networks. [paper] [code]
- Jonathan Frankle, Michael Carbin ICLR 2019
- Key Word: Lottery Ticket Hypothesis.
- Digest
  We find that a standard pruning technique naturally uncovers subnetworks whose initializations made them capable of training effectively. Based on these results, we articulate the "lottery ticket hypothesis:" dense, randomly-initialized, feed-forward networks contain subnetworks ("winning tickets") that - when trained in isolation - reach test accuracy comparable to the original network in a similar number of iterations.

Interactions with Neuroscience

Interactions with Neuroscience: 2022

Multilevel development of cognitive abilities in an artificial neural network. [paper]
- Konstantin Volzhenin, Jean-Pierre Changeux, Guillaume Dumas. PNAS
- Key Word: Global Neuronal Workspace.
- Digest
  We introduce a three-level computational model of information processing and acquisition of cognitive abilities. We propose minimal architectural requirements to build these levels, and how the parameters affect their performance and relationships. The first sensorimotor level handles local nonconscious processing, here during a visual classification task. The second level or cognitive level globally integrates the information from multiple local processors via long-ranged connections and synthesizes it in a global, but still nonconscious, manner. The third and cognitively highest level handles the information globally and consciously. It is based on the global neuronal workspace (GNW) theory and is referred to as the conscious level.
Deep Problems with Neural Network Models of Human Vision. [paper]
- Jeffrey S BowersGaurav MalhotraMarin DujmovićMilton Llera MonteroChristian TsvetkovValerio BiscioneGuillermo PueblaFederico G AdolfiJohn HummelRachel Flood HeatonBenjamin EvansJeff MitchellRyan Blything.
- Key Word: Brain-Score; Computational Neuroscience; Convolutional Neural Networks; Representational Similarity Analysis.
- Digest
  We show that the good prediction on these datasets may be mediated by DNNs that share little overlap with biological vision. More problematically, we show that DNNs account for almost no results from psychological research. This contradicts the common claim that DNNs are good, let alone the best, models of human object recognition.
Reassessing hierarchical correspondences between brain and deep networks through direct interface. [paper]
- Nicholas J Sexton, Bradley C Love. Science Advances
- Key Word: Neural Interfacing Analysis; Shared Neural Variance.
- Digest
  Functional correspondences between deep convolutional neural networks (DCNNs) and the mammalian visual system support a hierarchical account in which successive stages of processing contain ever higher-level information. However, these correspondences between brain and model activity involve shared, not task-relevant, variance. We propose a stricter account of correspondence: If a DCNN layer corresponds to a brain region, then replacing model activity with brain activity should successfully drive the DCNN’s object recognition decision. Using this approach on three datasets, we found that all regions along the ventral visual stream best corresponded with later model layers, indicating that all stages of processing contained higher-level information about object category.
Wiring Up Vision: Minimizing Supervised Synaptic Updates Needed to Produce a Primate Ventral Stream. [paper]
- Franziska Geiger, Martin Schrimpf, Tiago Marques, James J. DiCarlo. ICLR 2022
- Key Word: Computational Neuroscience; Primate Visual Ventral Stream.
- Digest
  We develop biologically-motivated initialization and training procedures to train models with 200x fewer synaptic updates (epochs x labeled images x weights) while maintaining 80% of brain predictivity on a set of neural and behavioral benchmarks.
Curriculum learning as a tool to uncover learning principles in the brain. [paper]
- Daniel R. Kepple, Rainer Engelken, Kanaka Rajan. ICLR 2022
- Key Word: Curriculum Learning; Neuroscience.
- Digest
  We present a novel approach to use curricula to identify principles by which a system learns. Previous work in curriculum learning has focused on how curricula can be designed to improve learning of a model on particular tasks. We consider the inverse problem: what can a curriculum tell us about how a learning system acquired a task? Using recurrent neural networks (RNNs) and models of common experimental neuroscience tasks, we demonstrate that curricula can be used to differentiate learning principles using target-based and a representation-based loss functions as use cases.
Building Transformers from Neurons and Astrocytes. [paper]
- Leo Kozachkov, Ksenia V. Kastanenka, Dmitry Krotov.
- Key Word: Transformers; Glia; Astrocytes.
- Digest
  In this work we hypothesize that neuron-astrocyte networks can naturally implement the core computation performed by the Transformer block in AI. The omnipresence of astrocytes in almost any brain area may explain the success of Transformers across a diverse set of information domains and computational tasks.
High-performing neural network models of visual cortex benefit from high latent dimensionality. [paper]
- Eric Elmoznino, Michael F. Bonner.
- Key Word: Dimensionality and Alignment in Computational Brain Models.
- Digest
  The prevailing view holds that optimal DNNs compress their representations onto low-dimensional manifolds to achieve invariance and robustness, which suggests that better models of visual cortex should have low-dimensional geometries. Surprisingly, we found a strong trend in the opposite direction—neural networks with high-dimensional image manifolds tend to have better generalization performance when predicting cortical responses to held-out stimuli in both monkey electrophysiology and human fMRI data.
On the Principles of Parsimony and Self-Consistency for the Emergence of Intelligence. [paper]
- Yi Ma, Doris Tsao, Heung-Yeung Shum.
- Key Word: Intelligence; Parsimony; Self-Consistency; Rate Reduction.
- Digest
  Ten years into the revival of deep networks and artificial intelligence, we propose a theoretical framework that sheds light on understanding deep networks within a bigger picture of Intelligence in general. We introduce two fundamental principles, Parsimony and Self-consistency, that we believe to be cornerstones for the emergence of Intelligence, artificial or natural. While these two principles have rich classical roots, we argue that they can be stated anew in entirely measurable and computable ways.
Painful intelligence: What AI can tell us about human suffering. [paper]
- Aapo Hyvärinen.
- Key Word: Neuroscience.
- Digest
  This book uses the modern theory of artificial intelligence (AI) to understand human suffering or mental pain. Both humans and sophisticated AI agents process information about the world in order to achieve goals and obtain rewards, which is why AI can be used as a model of the human brain and mind. This book intends to make the theory accessible to a relatively general audience, requiring only some relevant scientific background. The book starts with the assumption that suffering is mainly caused by frustration. Frustration means the failure of an agent (whether AI or human) to achieve a goal or a reward it wanted or expected.
The developmental trajectory of object recognition robustness: children are like small adults but unlike big deep neural networks. [paper] [code]
- Lukas S. Huber, Robert Geirhos, Felix A. Wichmann.
- Key Word: Object Recognition; Out-of-Distribution Generalization; Children.
- Digest
  We find, first, that already 4–6 year-olds showed remarkable robustness to image distortions and outperform DNNs trained on ImageNet. Second, we estimated the number of “images” children have been exposed to during their lifetime. Compared to various DNNs, children's high robustness requires relatively little data. Third, when recognizing objects children—like adults but unlike DNNs—rely heavily on shape but not on texture cues. Together our results suggest that the remarkable robustness to distortions emerges early in the developmental trajectory of human object recognition and is unlikely the result of a mere accumulation of experience with distorted visual input.
Finding Biological Plausibility for Adversarially Robust Features via Metameric Tasks. [paper] [code]
- Anne Harrington, Arturo Deza. ICLR 2022
- Key Word: Adversarial Robustness; Peripheral Computation; Psychophysics.
- Digest
  To understand how adversarially robust optimizations/representations compare to human vision, we performed a psychophysics experiment using a metamer task where we evaluated how well human observers could distinguish between images synthesized to match adversarially robust representations compared to non-robust representations and a texture synthesis model of peripheral vision. We found that the discriminability of robust representation and texture model images decreased to near chance performance as stimuli were presented farther in the periphery.

Interactions with Neuroscience: 2021

Relating transformers to models and neural representations of the hippocampal formation. [paper]
- James C.R. Whittington, Joseph Warren, Timothy E.J. Behrens. ICLR 2022
- Key Word: Transformers; Hippocampus; Cortex.
- Digest
  We show that transformers, when equipped with recurrent position encodings, replicate the precisely tuned spatial representations of the hippocampal formation; most notably place and grid cells. Furthermore, we show that this result is no surprise since it is closely related to current hippocampal models from neuroscience.
Partial success in closing the gap between human and machine vision. [paper] [code]
- Robert Geirhos, Kantharaju Narayanappa, Benjamin Mitzkus, Tizian Thieringer, Matthias Bethge, Felix A. Wichmann, Wieland Brendel. NeurIPS 2021
- Key Word: Out-of-Distribution Generalization; Psychophysical Experiments.
- Digest
  A few years ago, the first CNN surpassed human performance on ImageNet. However, it soon became clear that machines lack robustness on more challenging test cases, a major obstacle towards deploying machines "in the wild" and towards obtaining better computational models of human visual perception. Here we ask: Are we making progress in closing the gap between human and machine vision? To answer this question, we tested human observers on a broad range of out-of-distribution (OOD) datasets, recording 85,120 psychophysical trials across 90 participants.
Does enhanced shape bias improve neural network robustness to common corruptions? [paper]
- Chaithanya Kumar Mummadi, Ranjitha Subramaniam, Robin Hutmacher, Julien Vitay, Volker Fischer, Jan Hendrik Metzen. ICLR 2021
- Key Word: Shape-Texture; Robustness.
- Digest
  We perform a systematic study of different ways of composing inputs based on natural images, explicit edge information, and stylization. While stylization is essential for achieving high corruption robustness, we do not find a clear correlation between shape bias and robustness. We conclude that the data augmentation caused by style-variation accounts for the improved corruption robustness and increased shape bias is only a byproduct.

Interactions with Neuroscience: 2020

Simulating a Primary Visual Cortex at the Front of CNNs Improves Robustness to Image Perturbations. [paper]
- Joel Dapello, Tiago Marques, Martin Schrimpf, Franziska Geiger, David Cox, James J. DiCarlo. NeurIPS 2020
- Key Word: Robustness; V1 Model.
- Digest
  Current state-of-the-art object recognition models are largely based on convolutional neural network (CNN) architectures, which are loosely inspired by the primate visual system. However, these CNNs can be fooled by imperceptibly small, explicitly crafted perturbations, and struggle to recognize objects in corrupted images that are easily recognized by humans. Here, by making comparisons with primate neural data, we first observed that CNN models with a neural hidden layer that better matches primate primary visual cortex (V1) are also more robust to adversarial attacks. Inspired by this observation, we developed VOneNets, a new class of hybrid CNN vision models. Each VOneNet contains a fixed weight neural network front-end that simulates primate V1, called the VOneBlock, followed by a neural network back-end adapted from current CNN vision models.
On 1/n neural representation and robustness. [paper] [code]
- Josue Nassar, Piotr Aleksander Sokol, SueYeon Chung, Kenneth D. Harris, Il Memming Park. NeurIPS 2020
- Key Word: Adversarial Robustness; 1/n Power Law.
- Digest
  We investigate the latter by juxtaposing experimental results regarding the covariance spectrum of neural representations in the mouse V1 (Stringer et al) with artificial neural networks. We use adversarial robustness to probe Stringer et al's theory regarding the causal role of a 1/n covariance spectrum. We empirically investigate the benefits such a neural code confers in neural networks, and illuminate its role in multi-layer architectures. Our results show that imposing the experimentally observed structure on artificial neural networks makes them more robust to adversarial attacks. Moreover, our findings complement the existing theory relating wide neural networks to kernel methods, by showing the role of intermediate representations.
Shape-Texture Debiased Neural Network Training. [paper] [code]
- Yingwei Li, Qihang Yu, Mingxing Tan, Jieru Mei, Peng Tang, Wei Shen, Alan Yuille, Cihang Xie. ICLR 2021
- Key Word: Shape-Texture; Robustness.
- Digest
  Shape and texture are two prominent and complementary cues for recognizing objects. Nonetheless, Convolutional Neural Networks are often biased towards either texture or shape, depending on the training dataset. Our ablation shows that such bias degenerates model performance. Motivated by this observation, we develop a simple algorithm for shape-texture debiased learning. To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (eg, an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously.
Beyond accuracy: quantifying trial-by-trial behaviour of CNNs and humans by measuring error consistency. [paper] [code]
- Robert Geirhos, Kristof Meding, Felix A. Wichmann.
- Key Word: Error Consistency.
- Digest
  Here we introduce trial-by-trial error consistency, a quantitative analysis for measuring whether two decision making systems systematically make errors on the same inputs. Making consistent errors on a trial-by-trial basis is a necessary condition if we want to ascertain similar processing strategies between decision makers.
Biologically Inspired Mechanisms for Adversarial Robustness. [paper]
- Manish V. Reddy, Andrzej Banburski, Nishka Pant, Tomaso Poggio. NeurIPS 2020
- Key Word: Robustness; Retinal Fixations.
- Digest
  A convolutional neural network strongly robust to adversarial perturbations at reasonable computational and performance cost has not yet been demonstrated. The primate visual ventral stream seems to be robust to small perturbations in visual stimuli but the underlying mechanisms that give rise to this robust perception are not understood. In this work, we investigate the role of two biologically plausible mechanisms in adversarial robustness. We demonstrate that the non-uniform sampling performed by the primate retina and the presence of multiple receptive fields with a range of receptive field sizes at each eccentricity improve the robustness of neural networks to small adversarial perturbations
Five Points to Check when Comparing Visual Perception in Humans and Machines. [paper] [code]
- Christina M. Funke, Judy Borowski, Karolina Stosio, Wieland Brendel, Thomas S. A. Wallis, Matthias Bethge. JOV
- Key Word: Model Comparison.
- Digest
  With the rise of machines to human-level performance in complex recognition tasks, a growing amount of work is directed towards comparing information processing in humans and machines. These studies are an exciting chance to learn about one system by studying the other. Here, we propose ideas on how to design, conduct and interpret experiments such that they adequately support the investigation of mechanisms when comparing human and machine perception. We demonstrate and apply these ideas through three case studies.
Shortcut Learning in Deep Neural Networks. [paper] [code]
- Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, Felix A. Wichmann. Nature Machine Intelligence
- Key Word: Out-of-Distribution Generalization; Survey.
- Digest
  Deep learning has triggered the current rise of artificial intelligence and is the workhorse of today's machine intelligence. Numerous success stories have rapidly spread all over science, industry and society, but its limitations have only recently come into focus. In this perspective we seek to distil how many of deep learning's problem can be seen as different symptoms of the same underlying problem: shortcut learning. Shortcuts are decision rules that perform well on standard benchmarks but fail to transfer to more challenging testing conditions, such as real-world scenarios. Related issues are known in Comparative Psychology, Education and Linguistics, suggesting that shortcut learning may be a common characteristic of learning systems, biological and artificial alike. Based on these observations, we develop a set of recommendations for model interpretation and benchmarking, highlighting recent advances in machine learning to improve robustness and transferability from the lab to real-world applications.

Interactions with Neuroscience: 2019

White Noise Analysis of Neural Networks. [paper] [code]
- Ali Borji, Sikun Lin. ICLR 2020
- Key Word: Spike-Triggered Analysis.
- Digest
  A white noise analysis of modern deep neural networks is presented to unveil their biases at the whole network level or the single neuron level. Our analysis is based on two popular and related methods in psychophysics and neurophysiology namely classification images and spike triggered analysis.
The Origins and Prevalence of Texture Bias in Convolutional Neural Networks. [paper]
- Katherine L. Hermann, Ting Chen, Simon Kornblith. NeurIPS 2020
- Key Word: Shape-Texture; Robustness.
- Digest
  Recent work has indicated that, unlike humans, ImageNet-trained CNNs tend to classify images by texture rather than by shape. How pervasive is this bias, and where does it come from? We find that, when trained on datasets of images with conflicting shape and texture, CNNs learn to classify by shape at least as easily as by texture. What factors, then, produce the texture bias in CNNs trained on ImageNet? Different unsupervised training objectives and different architectures have small but significant and largely independent effects on the level of texture bias. However, all objectives and architectures still lead to models that make texture-based classification decisions a majority of the time, even if shape information is decodable from their hidden representations. The effect of data augmentation is much larger.
Learning From Brains How to Regularize Machines. [paper]
- Zhe Li, Wieland Brendel, Edgar Y. Walker, Erick Cobos, Taliah Muhammad, Jacob Reimer, Matthias Bethge, Fabian H. Sinz, Xaq Pitkow, Andreas S. Tolias. NeurIPS 2019
- Key Word: Neural Representation Similarity.
- Digest
  Despite impressive performance on numerous visual tasks, Convolutional Neural Networks (CNNs) --- unlike brains --- are often highly sensitive to small perturbations of their input, e.g. adversarial noise leading to erroneous decisions. We propose to regularize CNNs using large-scale neuroscience data to learn more robust neural features in terms of representational similarity. We presented natural images to mice and measured the responses of thousands of neurons from cortical visual areas.
A Unified Theory of Early Visual Representations from Retina to Cortex through Anatomically Constrained Deep CNNs. [paper] [code]
- Jack Lindsey, Samuel A. Ocko, Surya Ganguli, Stephane Deny. ICLR 2019
- Key Word: Visual System; Convolutional Neural Networks; Efficient Coding; Retina.
- Digest
  There is currently no unified theory explaining these differences in representations across layers. Here, using a deep convolutional neural network trained on image recognition as a model of the visual system, we show that such differences in representation can emerge as a direct consequence of different neural resource constraints on the retinal and cortical networks, and we find a single model from which both geometries spontaneously emerge at the appropriate stages of visual processing. The key constraint is a reduced number of neurons at the retinal output, consistent with the anatomy of the optic nerve as a stringent bottleneck.

Interactions with Neuroscience: 2018

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. [paper] [code]
- Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, Wieland Brendel. ICLR 2019
- Key Word: Shape-Texture; Psychophysical Experiments.
- Digest
  Convolutional Neural Networks (CNNs) are commonly thought to recognise objects by learning increasingly complex representations of object shapes. Some recent studies suggest a more important role of image textures. We here put these conflicting hypotheses to a quantitative test by evaluating CNNs and human observers on images with a texture-shape cue conflict. We show that ImageNet-trained CNNs are strongly biased towards recognising textures rather than shapes, which is in stark contrast to human behavioural evidence and reveals fundamentally different classification strategies.
Generalisation in humans and deep neural networks. [paper] [code]
- Robert Geirhos, Carlos R. Medina Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, Felix A. Wichmann. NeurIPS 2018
- Key Word: Robustness.
- Digest
  We compare the robustness of humans and current convolutional deep neural networks (DNNs) on object recognition under twelve different types of image degradations. First, using three well known DNNs (ResNet-152, VGG-19, GoogLeNet) we find the human visual system to be more robust to nearly all of the tested image manipulations, and we observe progressively diverging classification error-patterns between humans and DNNs when the signal gets weaker. Secondly, we show that DNNs trained directly on distorted images consistently surpass human performance on the exact distortion types they were trained on, yet they display extremely poor generalisation abilities when tested on other distortion types.

Interactions with Neuroscience: 2017

Comparing deep neural networks against humans: object recognition when the signal gets weaker. [paper] [code]
- Robert Geirhos, David H. J. Janssen, Heiko H. Schütt, Jonas Rauber, Matthias Bethge, Felix A. Wichmann. NeurIPS 2018
- Key Word: Model Comparison; Robustness.
- Digest
  Human visual object recognition is typically rapid and seemingly effortless, as well as largely independent of viewpoint and object orientation. Until very recently, animate visual systems were the only ones capable of this remarkable computational feat. This has changed with the rise of a class of computer vision algorithms called deep neural networks (DNNs) that achieve human-level classification performance on object recognition tasks. Furthermore, a growing number of studies report similarities in the way DNNs and the human visual system process objects, suggesting that current DNNs may be good models of human visual object recognition. Yet there clearly exist important architectural and processing differences between state-of-the-art DNNs and the primate visual system. The potential behavioural consequences of these differences are not well understood. We aim to address this issue by comparing human and DNN generalisation abilities towards image degradations.

Information Bottleneck

Information Bottleneck: 2022

Sparsity-Inducing Categorical Prior Improves Robustness of the Information Bottleneck. [paper]
- Anirban Samaddar, Sandeep Madireddy, Prasanna Balaprakash
- Key Word: Information Bottleneck; Robustness.
- Digest
  We present a novel sparsity-inducing spike-slab prior that uses sparsity as a mechanism to provide flexibility that allows each data point to learn its own dimension distribution. In addition, it provides a mechanism to learn a joint distribution of the latent variable and the sparsity. Thus, unlike other approaches, it can account for the full uncertainty in the latent space.

Information Bottleneck: 2021

Information Bottleneck Disentanglement for Identity Swapping. [paper]
- Gege Gao, Huaibo Huang, Chaoyou Fu, Zhaoyang Li, Ran He. CVPR 2021
- Key Word: Information Bottleneck; Identity Swapping.
- Digest
  We propose a novel information disentangling and swapping network, called InfoSwap, to extract the most expressive information for identity representation from a pre-trained face recognition model. The key insight of our method is to formulate the learning of disentangled representations as optimizing an information bottleneck trade-off, in terms of finding an optimal compression of the pre-trained latent features.
PAC-Bayes Information Bottleneck. [paper] [code]
- Zifeng Wang, Shao-Lun Huang, Ercan E. Kuruoglu, Jimeng Sun, Xi Chen, Yefeng Zheng. ICLR 2022
- Key Word: Information Bottleneck; PAC-Bayes.
- Digest
  There have been a series of theoretical works trying to derive non-vacuous bounds for NNs. Recently, the compression of information stored in weights (IIW) is proved to play a key role in NNs generalization based on the PAC-Bayes theorem. However, no solution of IIW has ever been provided, which builds a barrier for further investigation of the IIW's property and its potential in practical deep learning. In this paper, we propose an algorithm for the efficient approximation of IIW. Then, we build an IIW-based information bottleneck on the trade-off between accuracy and information complexity of NNs, namely PIB.
Information Bottleneck: Exact Analysis of (Quantized) Neural Networks. [paper] [code]
- Stephan Sloth Lorenzen, Christian Igel, Mads Nielsen. ICLR 2022
- Key Word: Information Bottleneck; Quantization.
- Digest
  We study the IB principle in settings where MI is non-trivial and can be computed exactly. We monitor the dynamics of quantized neural networks, that is, we discretize the whole deep learning system so that no approximation is required when computing the MI. This allows us to quantify the information flow without measurement errors.
Compressive Visual Representations. [paper] [code]
- Kuang-Huei Lee, Anurag Arnab, Sergio Guadarrama, John Canny, Ian Fischer. NeurIPS
- Key Word: Self-Supervision; Contrastive Learning; Conditional Entropy Bottleneck; Out-of-Distribution Generalization.
- Digest
  We hypothesize that adding explicit information compression to these algorithms yields better and more robust representations. We verify this by developing SimCLR and BYOL formulations compatible with the Conditional Entropy Bottleneck (CEB) objective, allowing us to both measure and control the amount of compression in the learned representation, and observe their impact on downstream tasks. Furthermore, we explore the relationship between Lipschitz continuity and compression, showing a tractable lower bound on the Lipschitz constant of the encoders we learn.
Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization. [paper] [code]
- Kartik Ahuja, Ethan Caballero, Dinghuai Zhang, Yoshua Bengio, Ioannis Mitliagkas, Irina Rish. NeurIPS 2021
- Key Word: Information Bottleneck; Out-of-Distribution Generalization; Invarianct Risk Minimization.
- Digest
  We revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address the key failures when invariant features capture all the information about the label and also retains the existing success when they do not.
Perturbation Theory for the Information Bottleneck. [paper]
- Vudtiwat Ngampruetikorn, David J. Schwab. NeurIPS 2021
- Key Word: Information Bottleneck; Perturbation Theory.
- Digest
  Extracting relevant information from data is crucial for all forms of learning. The information bottleneck (IB) method formalizes this, offering a mathematically precise and conceptually appealing framework for understanding learning phenomena. However the nonlinearity of the IB problem makes it computationally expensive and analytically intractable in general. Here we derive a perturbation theory for the IB method and report the first complete characterization of the learning onset, the limit of maximum relevant information per bit extracted from data.
A Critical Review of Information Bottleneck Theory and its Applications to Deep Learning. [paper]
- Mohammad Ali Alomrani.
- Key Word: Information Bottleneck; Survey.
- Digest
  A known information-theoretic method called the information bottleneck theory has emerged as a promising approach to better understand the learning dynamics of neural networks. In principle, IB theory models learning as a trade-off between the compression of the data and the retainment of information. The goal of this survey is to provide a comprehensive review of IB theory covering it's information theoretic roots and the recently proposed applications to understand deep learning models.

Information Bottleneck: 2020

Graph Information Bottleneck. [paper] [code]
- Tailin Wu, Hongyu Ren, Pan Li, Jure Leskovec. NeurIPS 2020
- Key Word: Information Bottleneck; Graph Neural Networks.
- Digest
  We introduce Graph Information Bottleneck (GIB), an information-theoretic principle that optimally balances expressiveness and robustness of the learned representation of graph-structured data. Inheriting from the general Information Bottleneck (IB), GIB aims to learn the minimal sufficient representation for a given task by maximizing the mutual information between the representation and the target, and simultaneously constraining the mutual information between the representation and the input data.
Learning Optimal Representations with the Decodable Information Bottleneck. [paper] [code]
- Yann Dubois, Douwe Kiela, David J. Schwab, Ramakrishna Vedantam. NeurIPS 2020
- Key Word: Information Bottleneck.
- Digest
  We propose the Decodable Information Bottleneck (DIB) that considers information retention and compression from the perspective of the desired predictive family. As a result, DIB gives rise to representations that are optimal in terms of expected test performance and can be estimated with guarantees. Empirically, we show that the framework can be used to enforce a small generalization gap on downstream classifiers and to predict the generalization ability of neural networks.
Concept Bottleneck Models. [paper] [code]
- Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, Percy Liang. ICML 2020
- Key Word: Information Bottleneck
- Digest
  We seek to learn models that we can interact with using high-level concepts: if the model did not think there was a bone spur in the x-ray, would it still predict severe arthritis? State-of-the-art models today do not typically support the manipulation of concepts like "the existence of bone spurs", as they are trained end-to-end to go directly from raw input (e.g., pixels) to output (e.g., arthritis severity). We revisit the classic idea of first predicting concepts that are provided at training time, and then using these concepts to predict the label. By construction, we can intervene on these concept bottleneck models by editing their predicted concept values and propagating these changes to the final prediction.
On Information Plane Analyses of Neural Network Classifiers -- A Review. [paper]
- Bernhard C. Geiger. TNNLS
- Key Word: Information Bottleneck; Survey.
- Digest
  We review the current literature concerned with information plane analyses of neural network classifiers. While the underlying information bottleneck theory and the claim that information-theoretic compression is causally linked to generalization are plausible, empirical evidence was found to be both supporting and conflicting. We review this evidence together with a detailed analysis of how the respective information quantities were estimated.
On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views. [paper]
- Abdellatif Zaidi, Inaki Estella Aguerri, Shlomo Shamai. Entropy
- Key Word: Information Bottleneck; Survey.
- Digest
  This tutorial paper focuses on the variants of the bottleneck problem taking an information theoretic perspective and discusses practical methods to solve it, as well as its connection to coding and learning aspects. The intimate connections of this setting to remote source-coding under logarithmic loss distortion measure, information combining, common reconstruction, the Wyner-Ahlswede-Korner problem, the efficiency of investment information, as well as, generalization, variational inference, representation learning, autoencoders, and others are highlighted.
Phase Transitions for the Information Bottleneck in Representation Learning. [paper]
- Tailin Wu, Ian Fischer. ICLR 2020
- Key Word: Information Bottleneck.
- Digest
  Our work provides the first theoretical formula to address IB phase transitions in the most general setting. In addition, we present an algorithm for iteratively finding the IB phase transition points.
Restricting the Flow: Information Bottlenecks for Attribution. [paper] [code]
- Karl Schulz, Leon Sixt, Federico Tombari, Tim Landgraf. ICLR 2020
- Key Word: Information Bottleneck; Attribution.
- Digest
  We adapt the information bottleneck concept for attribution. By adding noise to intermediate feature maps we restrict the flow of information and can quantify (in bits) how much information image regions provide.

Information Bottleneck: 2019

Learnability for the Information Bottleneck. [paper]
- Tailin Wu, Ian Fischer, Isaac L. Chuang, Max Tegmark. UAI 2019
- Key Word: Information Bottleneck.
- Digest
  We presented theoretical results for predicting the onset of learning, and have shown that it is determined by the conspicuous subset of the training examples. We gave a practical algorithm for predicting the transition as well as discovering this subset, and showed that those predictions are accurate, even in cases of extreme label noise.

Information Bottleneck: 2018

On the Information Bottleneck Theory of Deep Learning. [paper] [code]
- Andrew Michael Saxe, Yamini Bansal, Joel Dapello, Madhu Advani, Artemy Kolchinsky, Brendan Daniel Tracey, David Daniel Cox. ICLR 2018
- Key Word: Information Bottleneck.
- Digest
  This submission explores [recent theoretical work](https://arxiv.org/abs/1703.00810) by Shwartz-Ziv and Tishby on explaining the generalization ability of deep networks. The paper gives counter-examples that suggest aspects of the theory might not be relevant for all neural networks.

Information Bottleneck: 2017

Emergence of Invariance and Disentanglement in Deep Representations. [paper]
- Alessandro Achille, Stefano Soatto. JMLR
- Key Word: PAC-Bayes; Information Bottleneck.
- Digest
  Using established principles from Statistics and Information Theory, we show that invariance to nuisance factors in a deep neural network is equivalent to information minimality of the learned representation, and that stacking layers and injecting noise during training naturally bias the network towards learning invariant representations. We then decompose the cross-entropy loss used during training and highlight the presence of an inherent overfitting term. We propose regularizing the loss by bounding such a term in two equivalent ways: One with a Kullbach-Leibler term, which relates to a PAC-Bayes perspective; the other using the information in the weights as a measure of complexity of a learned model, yielding a novel Information Bottleneck for the weights.
Information-theoretic analysis of generalization capability of learning algorithms. [paper]
- Aolin Xu, Maxim Raginsky. NeurIPS 2017
- Key Word: Information Bottleneck.
- Digest
  We derive upper bounds on the generalization error of a learning algorithm in terms of the mutual information between its input and output. The bounds provide an information-theoretic understanding of generalization in learning problems, and give theoretical guidelines for striking the right balance between data fit and generalization by controlling the input-output mutual information. We propose a number of methods for this purpose, among which are algorithms that regularize the ERM algorithm with relative entropy or with random noise.
Opening the Black Box of Deep Neural Networks via Information. [paper]
- Ravid Shwartz-Ziv, Naftali Tishby.
- Key Word: Information Bottleneck.
- Digest
  [Previous work](https://arxiv.org/abs/1503.02406) proposed to analyze DNNs in the *Information Plane*; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs.

Neural Tangent Kernel

Neural Tangent Kernel: 2022

Transfer Learning with Kernel Methods. [paper]
- Adityanarayanan Radhakrishnan, Max Ruiz Luyten, Neha Prasad, Caroline Uhler.
- Key Word: Transfer Learning; Neural Tangent Kernel.
- Digest
  We propose a transfer learning framework for kernel methods by projecting and translating the source model to the target task. We demonstrate the effectiveness of our framework in applications to image classification and virtual drug screening. In particular, we show that transferring modern kernels trained on large-scale image datasets can result in substantial performance increase as compared to using the same kernel trained directly on the target task.
Neural Tangent Kernel: A Survey. [paper]
- Eugene Golikov, Eduard Pokonechnyy, Vladimir Korviakov.
- Key Word: Neural Tangent Kernel; Survey.
- Digest
  A seminal work [Jacot et al., 2018] demonstrated that training a neural network under specific parameterization is equivalent to performing a particular kernel method as width goes to infinity. This equivalence opened a promising direction for applying the results of the rich literature on kernel methods to neural nets which were much harder to tackle. The present survey covers key results on kernel convergence as width goes to infinity, finite-width corrections, applications, and a discussion of the limitations of the corresponding method.
Limitations of the NTK for Understanding Generalization in Deep Learning. [paper]
- Nikhil Vyas, Yamini Bansal, Preetum Nakkiran.
- Key Word: Neural Tangent Kernel.
- Digest
  In this work, we study NTKs through the lens of scaling laws, and demonstrate that they fall short of explaining important aspects of neural network generalization. In particular, we demonstrate realistic settings where finite-width neural networks have significantly better data scaling exponents as compared to their corresponding empirical and infinite NTKs at initialization. This reveals a more fundamental difference between the real networks and NTKs, beyond just a few percentage points of test accuracy. Further, we show that even if the empirical NTK is allowed to be pre-trained on a constant number of samples, the kernel scaling does not catch up to the neural network scaling. Finally, we show that the empirical NTK continues to evolve throughout most of the training, in contrast with prior work which suggests that it stabilizes after a few epochs of training. Altogether, our work establishes concrete limitations of the NTK approach in understanding generalization of real networks on natural datasets.
Fast Finite Width Neural Tangent Kernel. [paper] [code]
- Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz. ICML 2022
- Key Word: Neural Tangent Kernel.
- Digest
  In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency.
On the Generalization Power of the Overfitted Three-Layer Neural Tangent Kernel Model. [paper]
- Peizhong Ju, Xiaojun Lin, Ness B. Shroff.
- Key Word: Neural Tangent Kernel.
- Digest
  We study the generalization performance of overparameterized 3-layer NTK models. We show that, for a specific set of ground-truth functions (which we refer to as the "learnable set"), the test error of the overfitted 3-layer NTK is upper bounded by an expression that decreases with the number of neurons of the two hidden layers. Different from 2-layer NTK where there exists only one hidden-layer, the 3-layer NTK involves interactions between two hidden-layers. Our upper bound reveals that, between the two hidden-layers, the test error descends faster with respect to the number of neurons in the second hidden-layer (the one closer to the output) than with respect to that in the first hidden-layer (the one closer to the input).
Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks. [paper]
- Blake Bordelon, Cengiz Pehlevan.
- Key Word: Neural Tangent Kernel; Mean Field Theory.
- Digest
  We analyze feature learning in infinite width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training.

Neural Tangent Kernel: 2021

Neural Tangent Generalization Attacks. [paper] [code]
- Chia-Hung Yuan, Shan-Hung Wu. ICML 2021
- Key Word: Neural Tangent Kernel; Poisoning Attacks.
- Digest
  We study the generalization attacks against DNNs, where an attacker aims to slightly modify training data in order to spoil the training process such that a trained network lacks generalizability. These attacks can be performed by data owners and protect data from unexpected use. However, there is currently no efficient generalization attack against DNNs due to the complexity of a bilevel optimization involved. We propose the Neural Tangent Generalization Attack (NTGA) that, to the best of our knowledge, is the first work enabling clean-label, black-box generalization attack against DNNs.
On the Equivalence between Neural Network and Support Vector Machine. [paper] [code]
- Yilan Chen, Wei Huang, Lam M. Nguyen, Tsui-Wei Weng. NeurIPS 2021
- Key Word: Neural Tangent Kernel; Support Vector Machine.
- Digest
  We prove the equivalence between neural network (NN) and support vector machine (SVM), specifically, the infinitely wide NN trained by soft margin loss and the standard soft margin SVM with NTK trained by subgradient descent. Our main theoretical results include establishing the equivalence between NN and a broad family of L2 regularized kernel machines (KMs) with finite-width bounds, which cannot be handled by prior work, and showing that every finite-width NN trained by such regularized loss functions is approximately a KM.
An Empirical Study of Neural Kernel Bandits. [paper] [code]
- Michal Lisicki, Arash Afkanpour, Graham W. Taylor.
- Key Word: Neural Tangent Kernel.
- Digest
  We propose to directly apply NK-induced distributions to guide an upper confidence bound or Thompson sampling-based policy. We show that NK bandits achieve state-of-the-art performance on highly non-linear structured data. Furthermore, we analyze practical considerations such as training frequency and model partitioning.
A Neural Tangent Kernel Perspective of GANs. [paper] [code]
- Jean-Yves Franceschi, Emmanuel de Bézenac, Ibrahim Ayed, Mickaël Chen, Sylvain Lamprier, Patrick Gallinari. ICML 2021
- Key Word: Neural Tangent Kernel; Generative Adversarial Networks.
- Digest
  We propose a novel theoretical framework of analysis for Generative Adversarial Networks (GANs). We start by pointing out a fundamental flaw in previous theoretical analyses that leads to ill-defined gradients for the discriminator. We overcome this issue which impedes a principled study of GAN training, solving it within our framework by taking into account the discriminator's architecture. To this end, we leverage the theory of infinite-width neural networks for the discriminator via its Neural Tangent Kernel. We characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network.
Reverse Engineering the Neural Tangent Kernel. [paper] [code]
- James B. Simon, Sajant Anand, Michael R. DeWeese.
- Key Word: Neural Tangent Kernel.
- Digest
  The development of methods to guide the design of neural networks is an important open challenge for deep learning theory. As a paradigm for principled neural architecture design, we propose the translation of high-performing kernels, which are better-understood and amenable to first-principles design, into equivalent network architectures, which have superior efficiency, flexibility, and feature learning. To this end, we constructively prove that, with just an appropriate choice of activation function, any positive-semidefinite dot-product kernel can be realized as either the conjugate or neural tangent kernel of a fully-connected neural network with only one hidden layer.
Out-of-Distribution Generalization in Kernel Regression. [paper] [code]
- Abdulkadir Canatar, Blake Bordelon, Cengiz Pehlevan. NeurIPS 2021
- Key Word: Out-of-Distribution Generalization; Neural Tangent Kernel.
- Digest
  We study generalization in kernel regression when the training and test distributions are different using methods from statistical physics. Using the replica method, we derive an analytical formula for the out-of-distribution generalization error applicable to any kernel and real datasets. We identify an overlap matrix that quantifies the mismatch between distributions for a given kernel as a key determinant of generalization performance under distribution shift.
FL-NTK: A Neural Tangent Kernel-based Framework for Federated Learning Convergence Analysis. [paper]
- Baihe Huang, Xiaoxiao Li, Zhao Song, Xin Yang. ICML 2021
- Key Word: Federated Learning; Neural Tangent Kernel.
- Digest
  This paper presents a new class of convergence analysis for FL, Federated Learning Neural Tangent Kernel (FL-NTK), which corresponds to overparamterized ReLU neural networks trained by gradient descent in FL and is inspired by the analysis in Neural Tangent Kernel (NTK). Theoretically, FL-NTK converges to a global-optimal solution at a linear rate with properly tuned learning parameters. Furthermore, with proper distributional assumptions, FL-NTK can also achieve good generalization.
Random Features for the Neural Tangent Kernel. [paper]
- Insu Han, Haim Avron, Neta Shoham, Chaewon Kim, Jinwoo Shin.
- Key Word: Neural Tangent Kernel; Random Features.
- Digest
  We propose an efficient feature map construction of the NTK of fully-connected ReLU network which enables us to apply it to large-scale datasets. We combine random features of the arc-cosine kernels with a sketching-based algorithm which can run in linear with respect to both the number of data points and input dimension. We show that dimension of the resulting features is much smaller than other baseline feature map constructions to achieve comparable error bounds both in theory and practice.
Explaining Neural Scaling Laws. [paper] [code]
- Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, Utkarsh Sharma. ICLR 2022
- Key Word: Scaling Laws; Neural Tangent Kernel.
- Digest
  We propose a theory that explains and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold.

Neural Tangent Kernel: 2020

Mathematical Models of Overparameterized Neural Networks. [paper] [code]\
- Cong Fang, Hanze Dong, Tong Zhang. Proceedings of the IEEE
- Key Word: Neural Tangent Kernel; Mean-Field Theory.
- Digest
  Known by practitioners that overparameterized neural networks are easy to learn, in the past few years there have been important theoretical developments in the analysis of overparameterized neural networks. In particular, it was shown that such systems behave like convex systems under various restricted settings, such as for two-layer NNs, and when learning is restricted locally in the so-called neural tangent kernel space around specialized initializations. This paper discusses some of these recent progresses leading to significant better understanding of neural networks. We will focus on the analysis of two-layer neural networks, and explain the key mathematical models, with their algorithmic implications.
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel. [paper]
- Stanislav Fort, Gintare Karolina Dziugaite, Mansheej Paul, Sepideh Kharaghani, Daniel M. Roy, Surya Ganguli. NeurIPS 2020
- Key Word: Neural Tangent Kernel.
- Digest
  In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. Standard training, however, diverges from its linearization in ways that are poorly understood. We study the relationship between the training dynamics of nonlinear deep networks, the geometry of the loss landscape, and the time evolution of a data-dependent NTK.
Increasing Depth Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks. [paper]
- Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler.
- Key Word: Convolutional Neural Tangent Kernel.
- Digest
  We demonstrate that the test risk of over-parameterized convolutional networks is a U-shaped curve (i.e. monotonically decreasing, then increasing) with increasing depth. We first provide empirical evidence for this phenomenon via image classification experiments using both ResNets and the convolutional neural tangent kernel (CNTK). We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK.
Finite Versus Infinite Neural Networks: an Empirical Study. [paper] [code]
- Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein. NeurIPS 2020
- Key Word: Neural Tangent Kernel.
- Digest
  We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stopping; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime.
Bayesian Deep Ensembles via the Neural Tangent Kernel. [paper] [code]
- Bobby He, Balaji Lakshminarayanan, Yee Whye Teh.
- Key Word: Neural Tangent Kernel.
- Digest
  We explore the link between deep ensembles and Gaussian processes (GPs) through the lens of the Neural Tangent Kernel (NTK): a recent development in understanding the training dynamics of wide neural networks (NNs). Previous work has shown that even in the infinite width limit, when NNs become GPs, there is no GP posterior interpretation to a deep ensemble trained with squared error loss. We introduce a simple modification to standard deep ensembles training, through addition of a computationally-tractable, randomised and untrainable function to each ensemble member, that enables a posterior interpretation in the infinite width limit.
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks. [paper]
- Wei Hu, Lechao Xiao, Ben Adlam, Jeffrey Pennington. NeurIPS 2020
- Key Word: Neural Tangent Kernel.
- Digest
  We show that these common perceptions can be completely false in the early phase of learning. In particular, we formally prove that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs.
When Do Neural Networks Outperform Kernel Methods? [paper] [code]
- Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, Andrea Montanari. NeurIPS 2020
- Key Word: Neural Tangent Kernel.
- Digest
  How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work.
A Generalized Neural Tangent Kernel Analysis for Two-layer Neural Networks. [paper]
- Zixiang Chen, Yuan Cao, Quanquan Gu, Tong Zhang. NeurIPS 2020
- Key Word: Neural Tangent Kernel; Mean Field Theory.
- Digest
  We provide a generalized neural tangent kernel analysis and show that noisy gradient descent with weight decay can still exhibit a "kernel-like" behavior. This implies that the training loss converges linearly up to a certain accuracy. We also establish a novel generalization error bound for two-layer neural networks trained by noisy gradient descent with weight decay.

Neural Tangent Kernel: 2019

Disentangling Trainability and Generalization in Deep Neural Networks. [paper]
- Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz. ICML 2020
- Key Word: Neural Tangent Kernel.
- Digest
  We provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence.
Simple and Effective Regularization Methods for Training on Noisily Labeled Data with Generalization Guarantee. [paper]
- Wei Hu, Zhiyuan Li, Dingli Yu. ICLR 2020
- Key Word: Neural Tangent Kernel; Regularization.
- Digest
  This paper proposes and analyzes two simple and intuitive regularization methods: (i) regularization by the distance between the network parameters to initialization, and (ii) adding a trainable auxiliary variable to the network output for each training example. Theoretically, we prove that gradient descent training with either of these two methods leads to a generalization guarantee on the clean data distribution despite being trained using noisy labels.
On Exact Computation with an Infinitely Wide Neural Net. [paper] [code]
- Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, Ruosong Wang. NeurIPS 2019
- Key Word: Neural Tangent Kernel.
- Digest
  The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which we call Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation. [paper]
- Greg Yang.
- Key Word: Neural Tangent Kernel.
- Digest
  Several recent trends in machine learning theory and practice, from the design of state-of-the-art Gaussian Process to the convergence analysis of deep neural nets (DNNs) under stochastic gradient descent (SGD), have found it fruitful to study wide random neural networks. Central to these approaches are certain scaling limits of such networks. We unify these results by introducing a notion of a straightline \emph{tensor program} that can express most neural network computations, and we characterize its scaling limit when its tensors are large and randomized.

Neural Tangent Kernel: 2018

A Convergence Theory for Deep Learning via Over-Parameterization. [paper]
- Zeyuan Allen-Zhu, Yuanzhi Li, Zhao Song. ICML 2019
- Key Word: Stochastic Gradient Descent; Neural Tangent Kernel.
- Digest
  We prove why stochastic gradient descent (SGD) can find global minima on the training objective of DNNs in polynomial time. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: polynomial in L, the number of layers and in n, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Neural Tangent Kernel: Convergence and Generalization in Neural Networks. [paper]
- Arthur Jacot, Franck Gabriel, Clément Hongler. NeurIPS 2018
- Key Word: Neural Tangent Kernel.
- Digest
  We prove that the evolution of an ANN during training can also be described by a kernel: during gradient descent on the parameters of an ANN, the network function (which maps input vectors to output vectors) follows the kernel gradient of the functional cost (which is convex, in contrast to the parameter cost) w.r.t. a new kernel: the Neural Tangent Kernel (NTK).

Others

Others: 2022

Instance-Dependent Generalization Bounds via Optimal Transport. [paper]
- Songyan Hou, Parnian Kassraie, Anastasis Kratsios, Jonas Rothfuss, Andreas Krause.
- Key Word: Generalization Bounds; Optimal Transport; Distribution Shifts.
- Digest
  We propose a novel optimal transport interpretation of the generalization problem. This allows us to derive instance-dependent generalization bounds that depend on the local Lipschitz regularity of the learned prediction function} in the data space. Therefore, our bounds are agnostic to the parametrization of the model and work well when the number of training samples is much smaller than the number of parameters. With small modifications, our approach yields accelerated rates for data on low-dimensional manifolds, and guarantees under distribution shifts. We empirically analyze our generalization bounds for neural networks, showing that the bound values are meaningful and capture the effect of popular regularization methods during training.
Augmentation Invariant Manifold Learning. [paper]
- Shulei Wang.
- Key Word: Manifold Learning; Data Augmentation.
- Digest
  We develop a statistical framework on a low-dimension product manifold to theoretically understand why the unlabeled augmented data can lead to useful data representation. Under this framework, we propose a new representation learning method called augmentation invariant manifold learning and develop the corresponding loss function, which can work with a deep neural network to learn data representations.
The Curious Case of Benign Memorization. [paper]
- Sotiris Anagnostidis, Gregor Bachmann, Lorenzo Noci, Thomas Hofmann.
- Key Word: Memorization; Data Augmentation.
- Digest
  We show that under training protocols that include data augmentation, neural networks learn to memorize entirely random labels in a benign way, i.e. they learn embeddings that lead to highly non-trivial performance under nearest neighbour probing. We demonstrate that deep models have the surprising ability to separate noise from signal by distributing the task of memorization and feature learning to different layers.
Provably Learning Diverse Features in Multi-View Data with Midpoint Mixup. [paper]
- Muthu Chidambaram, Xiang Wang, Chenwei Wu, Rong Ge.
- Key Word: Mixup; Feature Learning.
- Digest
  We try to explain some of this success from a feature learning perspective. We focus our attention on classification problems in which each class may have multiple associated features (or views) that can be used to predict the class correctly. Our main theoretical results demonstrate that, for a non-trivial class of data distributions with two features per class, training a 2-layer convolutional network using empirical risk minimization can lead to learning only one feature for almost all classes while training with a specific instantiation of Mixup succeeds in learning both features for every class.
A PAC-Bayesian Generalization Bound for Equivariant Networks. [paper]
- Arash Behboodi, Gabriele Cesa, Taco Cohen. NeurIPS 2022
- Key Word: PAC-Bayes; Equivariant Networks.
- Digest
  We study how equivariance relates to generalization error utilizing PAC Bayesian analysis for equivariant networks, where the transformation laws of feature spaces are determined by group representations. By using perturbation analysis of equivariant networks in Fourier domain for each layer, we derive norm-based PAC-Bayesian generalization bounds. The bound characterizes the impact of group size, and multiplicity and degree of irreducible representations on the generalization error and thereby provide a guideline for selecting them.
Tighter PAC-Bayes Generalisation Bounds by Leveraging Example Difficulty. [paper]
- Felix Biggs, Benjamin Guedj.
- Key Word: PAC-Bayes.
- Digest
  We introduce a modified version of the excess risk, which can be used to obtain tighter, fast-rate PAC-Bayesian generalisation bounds. This modified excess risk leverages information about the relative hardness of data examples to reduce the variance of its empirical counterpart, tightening the bound. We combine this with a new bound for [−1,1]-valued (and potentially non-independent) signed losses, which is more favourable when they empirically have low variance around 0. The primary new technical tool is a novel result for sequences of interdependent random vectors which may be of independent interest. We empirically evaluate these new bounds on a number of real-world datasets.
How Mask Matters: Towards Theoretical Understandings of Masked Autoencoders. [paper] [code]
- Qi Zhang, Yifei Wang, Yisen Wang. NeurIPS 2022
- Key Word: Masked Autoencoders.
- Digest
  We propose a theoretical understanding of how masking matters for MAE to learn meaningful features. We establish a close connection between MAE and contrastive learning, which shows that MAE implicit aligns the mask-induced positive pairs. Built upon this connection, we develop the first downstream guarantees for MAE methods, and analyze the effect of mask ratio. Besides, as a result of the implicit alignment, we also point out the dimensional collapse issue of MAE, and propose a Uniformity-enhanced MAE (U-MAE) loss that can effectively address this issue and bring significant improvements on real-world datasets, including CIFAR-10, ImageNet-100, and ImageNet-1K.
On the Importance of Gradient Norm in PAC-Bayesian Bounds. [paper]
- Itai Gat, Yossi Adi, Alexander Schwing, Tamir Hazan. NeurIPS 2022
- Key Word: PAC-Bayes.
- Digest
  Generalization bounds which assess the difference between the true risk and the empirical risk, have been studied extensively. However, to obtain bounds, current techniques use strict assumptions such as a uniformly bounded or a Lipschitz loss function. To avoid these assumptions, in this paper, we follow an alternative approach: we relax uniform bounds assumptions by using on-average bounded loss and on-average bounded gradient norm assumptions. Following this relaxation, we propose a new generalization bound that exploits the contractivity of the log-Sobolev inequalities.
SGD with large step sizes learns sparse features. [paper]
- Maksym Andriushchenko, Aditya Varre, Loucas Pillaud-Vivien, Nicolas Flammarion.
- Key Word: Stochastic Gradient Descent; Sparse Features.
- Digest
  We showcase important features of the dynamics of the Stochastic Gradient Descent (SGD) in the training of neural networks. We present empirical observations that commonly used large step sizes (i) lead the iterates to jump from one side of a valley to the other causing loss stabilization, and (ii) this stabilization induces a hidden stochastic dynamics orthogonal to the bouncing directions that biases it implicitly toward simple predictors.
The good, the bad and the ugly sides of data augmentation: An implicit spectral regularization perspective. [paper]
- Chi-Heng Lin, Chiraag Kaushik, Eva L. Dyer, Vidya Muthukumar.
- Key Word: Data Augmentation; Spectral Regularization.
- Digest
  We develop a new theoretical framework to characterize the impact of a general class of DA on underparameterized and overparameterized linear model generalization. Our framework reveals that DA induces implicit spectral regularization through a combination of two distinct effects: a) manipulating the relative proportion of eigenvalues of the data covariance matrix in a training-data-dependent manner, and b) uniformly boosting the entire spectrum of the data covariance matrix through ridge regression.
Understanding Gradient Regularization in Deep Learning: Efficient Finite-Difference Computation and Implicit Bias. [paper]
- Ryo Karakida, Tomoumi Takase, Tomohiro Hayase, Kazuki Osawa.
- Key Word: Gradient Regularization; Implicit Bias.
- Digest
  We first reveal that a specific finite-difference computation, composed of both gradient ascent and descent steps, reduces the computational cost for GR. In addition, this computation empirically achieves better generalization performance. Next, we theoretically analyze a solvable model, a diagonal linear network, and clarify that GR has a desirable implicit bias in a certain problem. In particular, learning with the finite-difference GR chooses better minima as the ascent step size becomes larger.
The Dynamics of Sharpness-Aware Minimization: Bouncing Across Ravines and Drifting Towards Wide Minima. [paper]
- Peter L. Bartlett, Philip M. Long, Olivier Bousquet.
- Key Word: Sharpness-Aware Minimization.
- Digest
  We consider Sharpness-Aware Minimization (SAM), a gradient-based optimization method for deep networks that has exhibited performance improvements on image and language prediction problems. We show that when SAM is applied with a convex quadratic objective, for most random initializations it converges to a cycle that oscillates between either side of the minimum in the direction with the largest curvature, and we provide bounds on the rate of convergence.
SAM as an Optimal Relaxation of Bayes. [paper]
- Thomas Möllenhoff, Mohammad Emtiyaz Khan.
- Key Word: Sharpness-Aware Minimization; Bayesian Methods.
- Digest
  Sharpness-aware minimization (SAM) and related adversarial deep-learning methods can drastically improve generalization, but their underlying mechanisms are not yet fully understood. Here, we establish SAM as a relaxation of the Bayes objective where the expected negative-loss is replaced by the optimal convex lower bound, obtained by using the so-called Fenchel biconjugate. The connection enables a new Adam-like extension of SAM to automatically obtain reasonable uncertainty estimates, while sometimes also improving its accuracy. By connecting adversarial and Bayesian methods, our work opens a new path to robustness.
Understanding Influence Functions and Datamodels via Harmonic Analysis. [paper]
- Nikunj Saunshi, Arushi Gupta, Mark Braverman, Sanjeev Arora.
- Key Word: Influence Functions; Harmonic Analysis.
- Digest
  The current paper seeks to provide a better theoretical understanding of such interesting empirical phenomena. The primary tool is harmonic analysis and the idea of noise stability. Contributions include: (a) Exact characterization of the learnt datamodel in terms of Fourier coefficients. (b) An efficient method to estimate the residual error and quality of the optimum linear datamodel without having to train the datamodel. (c) New insights into when influences of groups of datapoints may or may not add up linearly.
Plateau in Monotonic Linear Interpolation -- A "Biased" View of Loss Landscape for Deep Networks. [paper]
- Xiang Wang, Annie N. Wang, Mo Zhou, Rong Ge.
- Key Word: Monotonic Linear Interpolation; Loss Landscapes.
- Digest
  We show that the MLI property is not necessarily related to the hardness of optimization problems, and empirical observations on MLI for deep neural networks depend heavily on biases. In particular, we show that interpolating both weights and biases linearly leads to very different influences on the final output, and when different classes have different last-layer biases on a deep network, there will be a long plateau in both the loss and accuracy interpolation (which existing theory of MLI cannot explain).
Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability. [paper]
- Alex Damian, Eshaan Nichani, Jason D. Lee.
- Key Word: Implicit Bias; Edge of Stability.
- Digest
  Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness S(θ), is bounded by 2/η, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff 2/η. The second, dubbed edge of stability, is that the sharpness hovers at 2/η for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored.
Scaling Laws For Deep Learning Based Image Reconstruction. [paper]
- Tobit Klug, Reinhard Heckel.
- Key Word: Scaling Laws; Inverse Problems.
- Digest
  We study whether major performance gains are expected from scaling up the training set size. We consider image denoising, accelerated magnetic resonance imaging, and super-resolution and empirically determine the reconstruction quality as a function of training set size, while optimally scaling the network size. For all three tasks we find that an initially steep power-law scaling slows significantly already at moderate training set sizes. Interpolating those scaling laws suggests that even training on millions of images would not significantly improve performance.
Why neural networks find simple solutions: the many regularizers of geometric complexity. [paper]
- Benoit Dherin, Michael Munn, Mihaela C. Rosca, David G.T. Barrett. NeurIPS 2022
- Key Word: Regularization; Geometric Complexity; Dirichlet Energy.
- Digest
  In many contexts, simpler models are preferable to more complex models and the control of this model complexity is the goal for many methods in machine learning such as regularization, hyperparameter tuning and architecture design. In deep learning, it has been difficult to understand the underlying mechanisms of complexity control, since many traditional measures are not naturally suitable for deep neural networks. Here we develop the notion of geometric complexity, which is a measure of the variability of the model function, computed using a discrete Dirichlet energy. Using a combination of theoretical arguments and empirical results, we show that many common training heuristics such as parameter norm regularization, spectral norm regularization, flatness regularization, implicit gradient regularization, noise regularization and the choice of parameter initialization all act to control geometric complexity, providing a unifying framework in which to characterize the behavior of deep learning models.
Variational Inference for Infinitely Deep Neural Networks. [paper]
- Achille Nazaret, David Blei. ICML 2022
- Key Word: Unbounded Depth Neural Networks; Variational Inference.
- Digest
  We develop a novel variational inference algorithm to approximate this posterior, optimizing a distribution of the neural network weights and of the truncation depth L, and without any upper limit on L. To this end, the variational family has a special structure: it models neural network weights of arbitrary depth, and it dynamically creates or removes free variational parameters as its distribution of the truncation is optimized.
Deep Linear Networks can Benignly Overfit when Shallow Ones Do. [paper]
- Niladri S. Chatterji, Philip M. Long.
- Key Word: Benign Overfitting; Double Descent; Implicit Bias.
- Digest
  We bound the excess risk of interpolating deep linear networks trained using gradient flow. In a setting previously used to establish risk bounds for the minimum ℓ2-norm interpolant, we show that randomly initialized deep linear networks can closely approximate or even match known bounds for the minimum ℓ2-norm interpolant. Our analysis also reveals that interpolating deep linear models have exactly the same conditional variance as the minimum ℓ2-norm solution.
Robustness in deep learning: The good (width), the bad (depth), and the ugly (initialization). [paper]
- Zhenyu Zhu, Fanghui Liu, Grigorios G Chrysos, Volkan Cevher. NeurIPS 2022
- Key Word: Lazy Training; Neural Tangent Kernel.
- Digest
  We study the average robustness notion in deep neural networks in (selected) wide and narrow, deep and shallow, as well as lazy and non-lazy training settings. We prove that in the under-parameterized setting, width has a negative effect while it improves robustness in the over-parameterized setting. The effect of depth closely depends on the initialization and the training mode. In particular, when initialized with LeCun initialization, depth helps robustness with lazy training regime. In contrast, when initialized with Neural Tangent Kernel (NTK) and He-initialization, depth hurts the robustness.
Normalization effects on deep neural networks. [paper]
- Jiahui Yu, Konstantinos Spiliopoulos.
- Key Word: Normalization.
- Digest
  We find that in terms of variance of the neural network's output and test accuracy the best choice is to choose the γi's to be equal to one, which is the mean-field scaling. We also find that this is particularly true for the outer layer, in that the neural network's behavior is more sensitive in the scaling of the outer layer as opposed to the scaling of the inner layers. The mechanism for the mathematical analysis is an asymptotic expansion for the neural network's output.
Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting. [paper]
- Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran.
- Key Word: Overfitting; Kernel Regression.
- Digest
  The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body of recent work has studied benign overfitting, a phenomenon where some interpolating methods approach Bayes optimality, even in the presence of noise. In this work we argue that while benign overfitting has been instructive and fruitful to study, many real interpolating methods like neural networks do not fit benignly: modest noise in the training set causes nonzero (but non-infinite) excess risk at test time, implying these models are neither benign nor catastrophic but rather fall in an intermediate regime. We call this intermediate regime tempered overfitting, and we initiate its systematic study.
Towards understanding how momentum improves generalization in deep learning. [paper]
- Samy Jelassi, Yuanzhi Li. ICML 2022
- Key Word: Gradient Descent with Momentum.
- Digest
  We adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized.
Git Re-Basin: Merging Models modulo Permutation Symmetries. [paper]
- Samuel K. Ainsworth, Jonathan Hayase, Siddhartha Srinivasa.
- Key Word: Mode Connectivity.
- Digest
  We argue that neural network loss landscapes contain (nearly) a single basin, after accounting for all possible permutation symmetries of hidden units. We introduce three algorithms to permute the units of one model to bring them into alignment with units of a reference model. This transformation produces a functionally equivalent set of weights that lie in an approximately convex basin near the reference model. Experimentally, we demonstrate the single basin phenomenon across a variety of model architectures and datasets, including the first (to our knowledge) demonstration of zero-barrier linear mode connectivity between independently trained ResNet models on CIFAR-10 and CIFAR-100.
Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm. [paper]
- Lechao Xiao, Jeffrey Pennington. ICML 2022
- Key Word: Synergy; Symmetry; Implicit Bias; Neural Tangent Kernel; Neural Scaling Laws.
- Digest
  Although learning in high dimensions is commonly believed to suffer from the curse of dimensionality, modern machine learning methods often exhibit an astonishing power to tackle a wide range of challenging real-world learning problems without using abundant amounts of data. How exactly these methods break this curse remains a fundamental open question in the theory of deep learning. While previous efforts have investigated this question by studying the data (D), model (M), and inference algorithm (I) as independent modules, in this paper, we analyze the triplet (D, M, I) as an integrated system and identify important synergies that help mitigate the curse of dimensionality.
Implicit Bias of Gradient Descent on Reparametrized Models: On Equivalence to Mirror Descent. [paper]
- Zhiyuan Li, Tianhao Wang, JasonD. Lee, Sanjeev Arora.
- Key Word: Implicit Bias; Mirror Descent.
- Digest
  As part of the effort to understand implicit bias of gradient descent in overparametrized models, several results have shown how the training trajectory on the overparametrized model can be understood as mirror descent on a different objective. The main result here is a characterization of this phenomenon under a notion termed commuting parametrization, which encompasses all the previous results in this setting. It is shown that gradient flow with any commuting parametrization is equivalent to continuous mirror descent with a related Legendre function.
Beyond neural scaling laws: beating power law scaling via data pruning. [paper]
- Ben Sorscher, Robert Geirhos, Shashank Shekhar, Surya Ganguli, Ari S. Morcos.
- Key Word: Dataset Pruning; Ensemble Active Learning.
- Digest
  Widely observed neural scaling laws, in which error falls off as a power of the training set size, model size, or both, have driven substantial performance improvements in deep learning. However, these improvements through scaling alone require considerable costs in compute and energy. Here we focus on the scaling of error with dataset size and show how both in theory and practice we can break beyond power law scaling and reduce it to exponential scaling instead if we have access to a high-quality data pruning metric that ranks the order in which training examples should be discarded to achieve any pruned dataset size. We then test this new exponential scaling prediction with pruned dataset size empirically, and indeed observe better than power law scaling performance on ResNets trained on CIFAR-10, SVHN, and ImageNet.
Robustness Implies Generalization via Data-Dependent Generalization Bounds. [paper]
- Kenji Kawaguchi, Zhun Deng, Kyle Luh, Jiaoyang Huang. ICML 2022
- Key Word: Algorithmic Robustness Bound.
- Digest
  This paper proves that robustness implies generalization via data-dependent generalization bounds. As a result, robustness and generalization are shown to be connected closely in a data-dependent manner. Our bounds improve previous bounds in two directions, to solve an open problem that has seen little development since 2010. The first is to reduce the dependence on the covering number. The second is to remove the dependence on the hypothesis space. We present several examples, including ones for lasso and deep learning, in which our bounds are provably preferable.
Learning sparse features can lead to overfitting in neural networks. [paper] [code]
- Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, Matthieu Wyart.
- Key Word: Sparse Representation; Neural Tangent Kernel.
- Digest
  It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images.
Towards Understanding Why Mask-Reconstruction Pretraining Helps in Downstream Tasks. [paper]
- Jiachun Pan, Pan Zhou, Shuicheng Yan.
- Key Word: Mask-Reconstruction Pretraining; Self-Supervision.
- Digest
  Supervised fine-tuning the pretrained encoder remarkably surpasses the conventional supervised learning (SL) trained from scratch. However, it is still unclear 1) how MRP performs semantic learning in the pretraining phase and 2) why it helps in downstream tasks. To solve these problems, we theoretically show that on an auto-encoder of a two/one-layered convolution encoder/decoder, MRP can capture all discriminative semantics in the pretraining dataset, and accordingly show its provable improvement over SL on the classification downstream task.
Why do CNNs Learn Consistent Representations in their First Layer Independent of Labels and Architecture? [paper]
- Rhea Chowers, Yair Weiss.
- Key Word: Architecture Inductive Bias.
- Digest
  It has previously been observed that the filters learned in the first layer of a CNN are qualitatively similar for different networks and tasks. We extend this finding and show a high quantitative similarity between filters learned by different networks. We consider the CNN filters as a filter bank and measure the sensitivity of the filter bank to different frequencies. We show that the sensitivity profile of different networks is almost identical, yet far from initialization. Remarkably, we show that it remains the same even when the network is trained with random labels. To understand this effect, we derive an analytic formula for the sensitivity of the filters in the first layer of a linear CNN. We prove that when the average patch in images of the two classes is identical, the sensitivity profile of the filters in the first layer will be identical in expectation when using the true labels or random labels and will only depend on the second-order statistics of image patches.
A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features. [paper]
- Zhenmei Shi, Junyi Wei, Yingyu Liang. ICLR 2022
- Key Word: Linearization of Neural Networks; Neural Tangent Kernel.
- Digest
  To better understand the source and benefit of feature learning in neural networks, we consider learning problems motivated by practical data, where the labels are determined by a set of class relevant patterns and the inputs are generated from these along with some background patterns. We prove that neural networks trained by gradient descent can succeed on these problems. The success relies on the emergence and improvement of effective features, which are learned among exponentially many candidates efficiently by exploiting the data (in particular, the structure of the input distribution).
Realistic Deep Learning May Not Fit Benignly. [paper]
- Kaiyue Wen, Jiaye Teng, Jingzhao Zhang.
- Key Word: Benign Overfitting.
- Digest
  We examine the benign overfitting phenomena in real-world settings. We found that for tasks such as training a ResNet model on ImageNet dataset, the model does not fit benignly. To understand why benign overfitting fails in the ImageNet experiment, we analyze previous benign overfitting models under a more restrictive setup where the number of parameters is not significantly larger than the number of data points.
A Model of One-Shot Generalization. [paper]
- Thomas Laurent, James H. von Brecht, Xavier Bresson.
- Key Word: One-Shot Generalization; PAC Learning; Neural Tangent Kernel.
- Digest
  We provide a theoretical framework to study a phenomenon that we call one-shot generalization. This phenomenon refers to the ability of an algorithm to perform transfer learning within a single task, meaning that it correctly classifies a test point that has a single exemplar in the training set. We propose a simple data model and use it to study this phenomenon in two ways. First, we prove a non-asymptotic base-line -- kernel methods based on nearest-neighbor classification cannot perform one-shot generalization, independently of the choice of the kernel and the size of the training set. Second, we empirically show that the most direct neural network architecture for our data model performs one-shot generalization almost perfectly. This stark differential leads us to believe that the one-shot generalization mechanism is partially responsible for the empirical success of neural networks.
Towards Understanding Grokking: An Effective Theory of Representation Learning. [paper]
- Ziming Liu, Ouail Kitouni, Niklas Nolte, Eric J. Michaud, Max Tegmark, Mike Williams.
- Key Word: Grokking; Physics of Learning; Deep Double Descent.
- Digest
  We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion.
Empirical Evaluation and Theoretical Analysis for Representation Learning: A Survey. [paper]
- Kento Nozawa, Issei Sato. IJCAI 2022
- Key Word: Representation Learning; Pre-training; Regularization.
- Digest
  Representation learning enables us to automatically extract generic feature representations from a dataset to solve another machine learning task. Recently, extracted feature representations by a representation learning algorithm and a simple predictor have exhibited state-of-the-art performance on several machine learning tasks. Despite its remarkable progress, there exist various ways to evaluate representation learning algorithms depending on the application because of the flexibility of representation learning. To understand the current representation learning, we review evaluation methods of representation learning algorithms and theoretical analyses.
The Effects of Regularization and Data Augmentation are Class Dependent. [paper]
- Randall Balestriero, Leon Bottou, Yann LeCun. NeurIPS 2022
- Key Word: Data Augmentation.
- Digest
  We demonstrate that techniques such as DA or weight decay produce a model with a reduced complexity that is unfair across classes. The optimal amount of DA or weight decay found from cross-validation leads to disastrous model performances on some classes e.g. on Imagenet with a resnet50, the "barn spider" classification test accuracy falls from 68% to 46% only by introducing random crop DA during training. Even more surprising, such performance drop also appears when introducing uninformative regularization techniques such as weight decay.
Resonance in Weight Space: Covariate Shift Can Drive Divergence of SGD with Momentum. [paper]
- Kirby Banman, Liam Peet-Pare, Nidhi Hegde, Alona Fyshe, Martha White. ICLR 2022
- Key Word: Stochastic Gradient Descent; Covariate Shift.
- Digest
  We show that SGDm under covariate shift with a fixed step-size can be unstable and diverge. In particular, we show SGDm under covariate shift is a parametric oscillator, and so can suffer from a phenomenon known as resonance. We approximate the learning system as a time varying system of ordinary differential equations, and leverage existing theory to characterize the system's divergence/convergence as resonant/nonresonant modes.
How Many Data Are Needed for Robust Learning? [paper]
- Hongyang Zhang, Yihan Wu, Heng Huang.
- Key Word: Robustness.
- Digest
  In this work, we study the sample complexity of robust interpolation problem when the data are in a unit ball. We show that both too many data and small data hurt robustness.
A Data-Augmentation Is Worth A Thousand Samples: Exact Quantification From Analytical Augmented Sample Moments. [paper]
- Randall Balestriero, Ishan Misra, Yann LeCun. NeurIPS 2022
- Key Word: Data Augmentation.
- Digest
  We derive several quantities in close-form, such as the expectation and variance of an image, loss, and model's output under a given DA distribution. Those derivations open new avenues to quantify the benefits and limitations of DA. For example, we show that common DAs require tens of thousands of samples for the loss at hand to be correctly estimated and for the model training to converge.

Others: 2021

Discovering and Explaining the Representation Bottleneck of DNNs. [paper]
- Huiqi Deng, Qihan Ren, Hao Zhang, Quanshi Zhang. ICLR 2022
- Key Word: Representation Bottleneck; Explanation.
- Digest
  This paper explores the bottleneck of feature representations of deep neural networks (DNNs), from the perspective of the complexity of interactions between input variables encoded in DNNs. To this end, we focus on the multi-order interaction between input variables, where the order represents the complexity of interactions. We discover that a DNN is more likely to encode both too simple and too complex interactions, but usually fails to learn interactions of intermediate complexity. Such a phenomenon is widely shared by different DNNs for different tasks. This phenomenon indicates a cognition gap between DNNs and humans, and we call it a representation bottleneck. We theoretically prove the underlying reason for the representation bottleneck.
The Equilibrium Hypothesis: Rethinking implicit regularization in Deep Neural Networks. [paper]
- Yizhang Lou, Chris Mingard, Soufiane Hayou.
- Key Word: Implicit Regularization.
- Digest
  We provide the first explanation for this alignment hierarchy. We introduce and empirically validate the Equilibrium Hypothesis which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels.
Understanding Dimensional Collapse in Contrastive Self-supervised Learning. [paper] [code]
- Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian. ICLR 2022
- Key Word: Self-Supervision; Contrastive Learning; Implicit Regularization; Dimensional Collapse.
- Digest
  We show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on a trainable projector.
Implicit Sparse Regularization: The Impact of Depth and Early Stopping. [paper] [code]
- Jiangyuan Li, Thanh V. Nguyen, Chinmay Hegde, Raymond K. W. Wong. NeurIPS 2021
- Key Word: Implicit Regularization.
- Digest
  In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more realistic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases.
The Benefits of Implicit Regularization from SGD in Least Squares Problems. [paper]
- Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, Dean P. Foster, Sham M. Kakade. NeurIPS 2021
- Key Word: Implicit Regularization.
Digest
We show: (1) for every problem instance and for every ridge parameter, (unregularized) SGD, when provided with logarithmically more samples than that provided to the ridge algorithm, generalizes no worse than the ridge solution (provided SGD uses a tuned constant stepsize); (2) conversely, there exist instances (in this wide problem class) where optimally-tuned ridge regression requires quadratically more samples than SGD in order to have the same generalization performance.
Neural Controlled Differential Equations for Online Prediction Tasks. [paper] [code]
- James Morrill, Patrick Kidger, Lingyi Yang, Terry Lyons.
- Key Word: Ordinary Differential Equations.
- Digest
  Neural controlled differential equations (Neural CDEs) are state-of-the-art models for irregular time series. However, due to current implementations relying on non-causal interpolation schemes, Neural CDEs cannot currently be used in online prediction tasks; that is, in real-time as data arrives. This is in contrast to similar ODE models such as the ODE-RNN which can already operate in continuous time. Here we introduce and benchmark new interpolation schemes, most notably, rectilinear interpolation, which allows for an online everywhere causal solution to be defined.
Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning. [paper] [code]
- Colin Wei, Sang Michael Xie, Tengyu Ma. NeurIPS 2021
- Key Word: Natural Language Processing; Pre-training; Prompting.
- Digest
  We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language.
Differentiable Multiple Shooting Layers. [paper] [code]
- Stefano Massaroli, Michael Poli, Sho Sonoda, Taji Suzuki, Jinkyoo Park, Atsushi Yamashita, Hajime Asama. NeurIPS 2021
- Key Word: Ordinary Differential Equations.
- Digest
  We detail a novel class of implicit neural models. Leveraging time-parallel methods for differential equations, Multiple Shooting Layers (MSLs) seek solutions of initial value problems via parallelizable root-finding algorithms. MSLs broadly serve as drop-in replacements for neural ordinary differential equations (Neural ODEs) with improved efficiency in number of function evaluations (NFEs) and wall-clock inference time.
Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning. [paper] [code]
- Jannik Kossen, Neil Band, Clare Lyle, Aidan N. Gomez, Tom Rainforth, Yarin Gal. NeurIPS 2021
- Key Word: Samplie-Wise Self-Attention; Meta Learning; Metric Learning.
- Digest
  We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input. To this end, we introduce a general-purpose deep learning architecture that takes as input the entire dataset instead of processing one datapoint at a time. Our approach uses self-attention to reason about relationships between datapoints explicitly, which can be seen as realizing non-parametric models using parametric attention mechanisms.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation. [paper]
- Mikhail Belkin.
- Key Word: Interpolation; Over-parameterization.
- Digest
  In the past decade the mathematical theory of machine learning has lagged far behind the triumphs of deep neural networks on practical challenges. However, the gap between theory and practice is gradually starting to close. In this paper I will attempt to assemble some pieces of the remarkable and still incomplete mathematical mosaic emerging from the efforts to understand the foundations of deep learning. The two key themes will be interpolation, and its sibling, over-parameterization. Interpolation corresponds to fitting data, even noisy data, exactly. Over-parameterization enables interpolation and provides flexibility to select a right interpolating model.
A Universal Law of Robustness via Isoperimetry. [paper]
- Sébastien Bubeck, Mark Sellke.
- Key Word: Overparameterized Memorization; Lipschitz Neural Network.
- Digest
  A puzzling phenomenon in deep learning is that models are trained with many more parameters than what this classical theory would suggest. We propose a theoretical explanation for this phenomenon. We prove that for a broad class of data distributions and model classes, overparametrization is necessary if one wants to interpolate the data smoothly. Namely we show that smooth interpolation requires d times more parameters than mere interpolation, where d is the ambient data dimension.
Analyzing Monotonic Linear Interpolation in Neural Network Loss Landscapes. [paper]
- James Lucas, Juhan Bae, Michael R. Zhang, Stanislav Fort, Richard Zemel, Roger Grosse.
- Key Word: Monotonic Linear Interpolation; Loss Landscapes.
- Digest
  We evaluate several hypotheses for this property that, to our knowledge, have not yet been explored. Using tools from differential geometry, we draw connections between the interpolated paths in function space and the monotonicity of the network - providing sufficient conditions for the MLI property under mean squared error. While the MLI property holds under various settings (e.g. network architectures and learning problems), we show in practice that networks violating the MLI property can be produced systematically, by encouraging the weights to move far from initialization.
MALI: A memory efficient and reverse accurate integrator for Neural ODEs. [paper] [code]
- Juntang Zhuang, Nicha C. Dvornek, Sekhar Tatikonda, James S. Duncan. ICLR 2021
- Key Word: Ordinary Differential Equations.
- Digest
  Based on the asynchronous leapfrog (ALF) solver, we propose the Memory-efficient ALF Integrator (MALI), which has a constant memory cost w.r.t number of solver steps in integration similar to the adjoint method, and guarantees accuracy in reverse-time trajectory (hence accuracy in gradient estimation). We validate MALI in various tasks: on image recognition tasks, to our knowledge, MALI is the first to enable feasible training of a Neural ODE on ImageNet and outperform a well-tuned ResNet, while existing methods fail due to either heavy memory burden or inaccuracy.

Others: 2020

Understanding the Failure Modes of Out-of-Distribution Generalization. [paper] [code]
- Vaishnavh Nagarajan, Anders Andreassen, Behnam Neyshabur. ICLR 2021
- Key Word: Out-of-Distribution Generalization.
- Digest
  We identify that spurious correlations during training can induce two distinct skews in the training set, one geometric and another statistical. These skews result in two complementary ways by which empirical risk minimization (ERM) via gradient descent is guaranteed to rely on those spurious correlations.
Sharpness-Aware Minimization for Efficiently Improving Generalization. [paper] [code]
- Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur. ICLR 2021
- Key Word: Flat Minima.
- Digest
  In today's heavily overparameterized models, the value of the training loss provides few guarantees on model generalization ability. Indeed, optimizing only the training loss value, as is commonly done, can easily lead to suboptimal model quality. Motivated by prior work connecting the geometry of the loss landscape and generalization, we introduce a novel, effective procedure for instead simultaneously minimizing loss value and loss sharpness. In particular, our procedure, Sharpness-Aware Minimization (SAM), seeks parameters that lie in neighborhoods having uniformly low loss; this formulation results in a min-max optimization problem on which gradient descent can be performed efficiently.
Implicit Gradient Regularization. [paper]
- David G.T. Barrett, Benoit Dherin. ICLR 2021
- Key Word: Implicit Regularization.
- Digest
  Gradient descent can be surprisingly good at optimizing deep neural networks without overfitting and without explicit regularization. We find that the discrete steps of gradient descent implicitly regularize models by penalizing gradient descent trajectories that have large loss gradients. We call this Implicit Gradient Regularization (IGR) and we use backward error analysis to calculate the size of this regularization. We confirm empirically that implicit gradient regularization biases gradient descent toward flat minima, where test errors are small and solutions are robust to noisy parameter perturbations.
Neural Rough Differential Equations for Long Time Series. [paper] [code]
- James Morrill, Cristopher Salvi, Patrick Kidger, James Foster, Terry Lyons. ICML 2021
- Key Word: Ordinary Differential Equations.
- Digest
  Neural Controlled Differential Equations (Neural CDEs) are the continuous-time analogue of an RNN. However, as with RNNs, training can quickly become impractical for long time series. Here we use rough path theory to extend this formulation through application of a pre-existing mathematical tool from rough analysis - the log-ODE method - which allows us to take integration steps larger than the discretisation of the data, resulting in significantly faster training times, with retainment (and often even improvements) in model performance.
Optimizing Mode Connectivity via Neuron Alignment. [paper] [code]
- N. Joseph Tatro, Pin-Yu Chen, Payel Das, Igor Melnyk, Prasanna Sattigeri, Rongjie Lai. NeurIPS 2020
- Key Word: Mode Connectivity; Neuron Alignment; Adversarial Training.
- Digest
  We propose a more general framework to investigate the effect of symmetry on landscape connectivity by accounting for the weight permutations of the networks being connected. To approximate the optimal permutation, we introduce an inexpensive heuristic referred to as neuron alignment. Neuron alignment promotes similarity between the distribution of intermediate activations of models along the curve.
Benign Overfitting and Noisy Features. [paper]
- Zhu Li, Weijie Su, Dino Sejdinovic.
- Key Word: Benign Overfitting; Random Feature Approximation; Deep Double Descent.
- Digest
  We examine the conditions under which Benign Overfitting occurs in the random feature (RF) models, i.e. in a two-layer neural network with fixed first layer weights. We adopt a new view of random feature and show that benign overfitting arises due to the noise which resides in such features (the noise may already be present in the data and propagate to the features or it may be added by the user to the features directly) and plays an important implicit regularization role in the phenomenon.
How benign is benign overfitting? [paper]
- Amartya Sanyal, Puneet K Dokania, Varun Kanade, Philip H.S. Torr. ICLR 2021
- Key Word: Benign Overfitting; Adversarial Robustness.
- Digest
  We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting. However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don’t fit the noise.
On the Theory of Transfer Learning: The Importance of Task Diversity. [paper]
- Nilesh Tripuraneni, Michael I. Jordan, Chi Jin. NeurIPS 2020
- Key Word: Transfer Learning; Task Diversity; Generalization Bound.
- Digest
  We introduce a problem-agnostic definition of task diversity which can be integrated into a uniform convergence framework to provide generalization bounds for transfer learning problems with general losses, tasks, and features. Our framework puts this notion of diversity together with a common-design assumption across tasks to provide guarantees of a fast convergence rate, decaying with all of the samples for the transfer learning problem.
Neural Controlled Differential Equations for Irregular Time Series. [paper] [code]
- Patrick Kidger, James Morrill, James Foster, Terry Lyons. NeurIPS 2020
- Key Word: Ordinary Differential Equations.
- Digest
  a fundamental issue is that the solution to an ordinary differential equation is determined by its initial condition, and there is no mechanism for adjusting the trajectory based on subsequent observations. Here, we demonstrate how this may be resolved through the well-understood mathematics of controlled differential equations.
Finite-sample Analysis of Interpolating Linear Classifiers in the Overparameterized Regime. [paper]
- Niladri S. Chatterji, Philip M. Long. JMLR
- Key Word: Benign Overfitting; Finite-Sample Analysis.
- Digest
  We prove bounds on the population risk of the maximum margin algorithm for two-class linear classification. For linearly separable training data, the maximum margin algorithm has been shown in previous work to be equivalent to a limit of training with logistic loss using gradient descent, as the training error is driven to zero. We analyze this algorithm applied to random data including misclassification noise. Our assumptions on the clean data include the case in which the class-conditional distributions are standard normal distributions. The misclassification noise may be chosen by an adversary, subject to a limit on the fraction of corrupted labels. Our bounds show that, with sufficient over-parameterization, the maximum margin algorithm trained on noisy data can achieve nearly optimal population risk.
Dissecting Neural ODEs. [paper] [code]
- Stefano Massaroli, Michael Poli, Jinkyoo Park, Atsushi Yamashita, Hajime Asama. NeurIPS 2020
- Key Word: Ordinary Differential Equations.
- Digest
  Continuous deep learning architectures have recently re-emerged as Neural Ordinary Differential Equations (Neural ODEs). This infinite-depth approach theoretically bridges the gap between deep learning and dynamical systems, offering a novel perspective. However, deciphering the inner working of these models is still an open challenge, as most applications apply them as generic black-box modules. In this work we "open the box", further developing the continuous-depth formulation with the aim of clarifying the influence of several design choices on the underlying dynamics.
Proving the Lottery Ticket Hypothesis: Pruning is All You Need. [paper]
- Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, Ohad Shamir. ICML 2020
- Key Word: Lottery Ticket Hypothesis.
- Digest
  The lottery ticket hypothesis (Frankle and Carbin, 2018), states that a randomly-initialized network contains a small subnetwork such that, when trained in isolation, can compete with the performance of the original network. We prove an even stronger hypothesis (as was also conjectured in Ramanujan et al., 2019), showing that for every bounded distribution and every target network with bounded weights, a sufficiently over-parameterized neural network with random weights contains a subnetwork with roughly the same accuracy as the target network, without any further training.

Others: 2019

Deep Learning via Dynamical Systems: An Approximation Perspective. [paper]
- Qianxiao Li, Ting Lin, Zuowei Shen.
- Key Word: Approximation Theory; Controllability.
- Digest
  We build on the dynamical systems approach to deep learning, where deep residual networks are idealized as continuous-time dynamical systems, from the approximation perspective. In particular, we establish general sufficient conditions for universal approximation using continuous-time deep residual networks, which can also be understood as approximation theories in Lp using flow maps of dynamical systems.
Why bigger is not always better: on finite and infinite neural networks. [paper]
- Laurence Aitchison. ICML 2020
- Key Word: Gradient Dynamics.
- Digest
  We give analytic results characterising the prior over representations and representation learning in finite deep linear networks. We show empirically that the representations in SOTA architectures such as ResNets trained with SGD are much closer to those suggested by our deep linear results than by the corresponding infinite network.
Deep Learning Theory Review: An Optimal Control and Dynamical Systems Perspective. [paper] [code]
- Guan-Horng Liu, Evangelos A. Theodorou.
- Key Word: Mean Field Theory.
- Digest
  We provide one possible way to align existing branches of deep learning theory through the lens of dynamical system and optimal control. By viewing deep neural networks as discrete-time nonlinear dynamical systems, we can analyze how information propagates through layers using mean field theory.
Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks. [paper] [code]
- Yuanzhi Li, Colin Wei, Tengyu Ma. NeurIPS 2019
- Key Word: Regularization.
- Digest
  The key insight in our analysis is that the order of learning different types of patterns is crucial: because the small learning rate model first memorizes easy-to-generalize, hard-to-fit patterns, it generalizes worse on hard-to-generalize, easier-to-fit patterns than its large learning rate counterpart.
Are deep ResNets provably better than linear predictors? [paper]
- Chulhee Yun, Suvrit Sra, Ali Jadbabaie. NeurIPS 2019
- Key Word: ResNets; Local Minima.
- Digest
  We investigated the question whether local minima of risk function of a deep ResNet are better than linear predictors. We showed two motivating examples showing 1) the advantage of ResNets over fully-connected networks, and 2) difficulty in analysis of deep ResNets.
Benign Overfitting in Linear Regression. [paper]
- Peter L. Bartlett, Philip M. Long, Gábor Lugosi, Alexander Tsigler. PNAS
- Key Word: Benign Overfitting.
- Digest
  The phenomenon of benign overfitting is one of the key mysteries uncovered by deep learning methodology: deep neural networks seem to predict well, even with a perfect fit to noisy training data. Motivated by this phenomenon, we consider when a perfect fit to training data in linear regression is compatible with accurate prediction. We give a characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy. The characterization is in terms of two notions of the effective rank of the data covariance. It shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size.
Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness. [paper]
- Fanny Yang, Zuowen Wang, Christina Heinze-Deml. NeurIPS 2019
- Key Word: Robustness; Regularization.
- Digest
  This work provides theoretical and empirical evidence that invariance-inducing regularizers can increase predictive accuracy for worst-case spatial transformations (spatial robustness). Evaluated on these adversarially transformed examples, we demonstrate that adding regularization on top of standard or adversarial training reduces the relative error by 20% for CIFAR10 without increasing the computational cost.
Augmented Neural ODEs. [paper] [code]
- Emilien Dupont, Arnaud Doucet, Yee Whye Teh. NeurIPS 2019
- Key Word: Ordinary Differential Equations.
- Digest
  We show that Neural Ordinary Differential Equations (ODEs) learn representations that preserve the topology of the input space and prove that this implies the existence of functions Neural ODEs cannot represent. To address these limitations, we introduce Augmented Neural ODEs which, in addition to being more expressive models, are empirically more stable, generalize better and have a lower computational cost than Neural ODEs.
On the Power and Limitations of Random Features for Understanding Neural Networks. [paper]
- Gilad Yehudai, Ohad Shamir.
- Key Word: Random Features.
- Digest
  Recently, a spate of papers have provided positive theoretical results for training over-parameterized neural networks (where the network size is larger than what is needed to achieve low error). The key insight is that with sufficient over-parameterization, gradient-based methods will implicitly leave some components of the network relatively unchanged, so the optimization dynamics will behave as if those components are essentially fixed at their initial random values. In fact, fixing these explicitly leads to the well-known approach of learning with random features. In other words, these techniques imply that we can successfully learn with neural networks, whenever we can successfully learn with random features. In this paper, we first review these techniques, providing a simple and self-contained analysis for one-hidden-layer networks.
Mean Field Analysis of Deep Neural Networks. [paper]
- Justin Sirignano, Konstantinos Spiliopoulos.
- Key Word: Mean Field Theory.
- Digest
  We analyze multi-layer neural networks in the asymptotic regime of simultaneously (A) large network sizes and (B) large numbers of stochastic gradient descent training iterations. We rigorously establish the limiting behavior of the multi-layer neural network output. The limit procedure is valid for any number of hidden layers and it naturally also describes the limiting behavior of the training loss.
Machine learning meets quantum physics. [paper] [book]
- Sankar Das Sarma, Dong-Ling Deng, Lu-Ming Duan.
- Key Word: Physics-based Machine Learning; Quantum Physics; Quantum Chemistry.
- Digest
  The marriage of machine learning and quantum physics may give birth to a new research frontier that could transform both.
A Mean Field Theory of Batch Normalization. [paper]
- Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz. ICLR 2019
- Key Word: Mean Field Theory.
- Digest
  We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent. [paper] [code]
- Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington. NeurIPS 2019
- Key Word: Mean Field Theory.
- Digest
  We show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel.
Superposition of many models into one. [paper] [code]
- Brian Cheung, Alex Terekhov, Yubei Chen, Pulkit Agrawal, Bruno Olshausen. NeurIPS 2019
- Key Word: Parameter Superposition; Catastrophic Forgetting.
- Digest
  We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition and still be retrieved individually. In experiments with neural networks, we show that a surprisingly large number of models can be effectively stored within a single parameter instance. Furthermore, each of these models can undergo thousands of training steps without significantly interfering with other models within the superposition. This approach may be viewed as the online complement of compression: rather than reducing the size of a network after training, we make use of the unrealized capacity of a network during training.
On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points. [paper]
- Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, Michael I. Jordan. ICML 2017
- Key Word: Gradient Descent; Saddle Points.
- Digest
  Traditional analyses of GD and SGD show that both algorithms converge to stationary points efficiently. But these analyses do not take into account the possibility of converging to saddle points. More recent theory has shown that GD and SGD can avoid saddle points, but the dependence on dimension in these analyses is polynomial. For modern machine learning, where the dimension can be in the millions, such dependence would be catastrophic. We analyze perturbed versions of GD and SGD and show that they are truly efficient---their dimension dependence is only polylogarithmic. Indeed, these algorithms converge to second-order stationary points in essentially the same time as they take to converge to classical first-order stationary points.
Escaping Saddle Points with Adaptive Gradient Methods. [paper]
- Matthew Staib, Sashank J. Reddi, Satyen Kale, Sanjiv Kumar, Suvrit Sra. ICML 2019
- Key Word: Gradient Descent; Saddle Points.
- Digest
  We seek a crisp, clean and precise characterization of their behavior in nonconvex settings. To this end, we first provide a novel view of adaptive methods as preconditioned SGD, where the preconditioner is estimated in an online manner. By studying the preconditioner on its own, we elucidate its purpose: it rescales the stochastic gradient noise to be isotropic near stationary points, which helps escape saddle points.

Others: 2018

A Spline Theory of Deep Learning. [paper]
- Randall Balestriero, Richard G. Baraniuk. ICML 2018
- Key Word: Approximation Theory.
- Digest
  We build a rigorous bridge between deep networks (DNs) and approximation theory via spline functions and operators. Our key result is that a large class of DNs can be written as a composition of max-affine spline operators (MASOs), which provide a powerful portal through which to view and analyze their inner workings.
On Lazy Training in Differentiable Programming. [paper] [code]
- Lenaic Chizat, Edouard Oyallon, Francis Bach. NeurIPS 2019
- Key Word: Lazy Training.
- Digest
  In a series of recent theoretical works, it was shown that strongly over-parameterized neural networks trained with gradient-based methods could converge exponentially fast to zero training loss, with their parameters hardly varying. In this work, we show that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling, often implicit, that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels. Through a theoretical analysis, we exhibit various situations where this phenomenon arises in non-convex optimization and we provide bounds on the distance between the lazy and linearized optimization paths.
Why ReLU networks yield high-confidence predictions far away from the training data and how to mitigate the problem. [paper] [code]
- Matthias Hein, Maksym Andriushchenko, Julian Bitterwolf. CVPR 2019
- Key Wrod: ReLU; Adversarial Example.
- Digest
  We show that ReLU type neural networks which yield a piecewise linear classifier function fail in this regard as they produce almost always high confidence predictions far away from the training data. For bounded domains like images we propose a new robust optimization technique similar to adversarial training which enforces low confidence predictions far away from the training data.
Gradient Descent Finds Global Minima of Deep Neural Networks. [paper]
- Simon S. Du, Jason D. Lee, Haochuan Li, Liwei Wang, Xiyu Zhai. ICML 2019
- Key Word: Gradient Descent; Gradient Dynamics.
- Digest
  Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm.
Memorization in Overparameterized Autoencoders. [paper]
- Adityanarayanan Radhakrishnan, Karren Yang, Mikhail Belkin, Caroline Uhler.
- Key Word: Autoencoders; Memorization.
- Digest
  We show that overparameterized autoencoders exhibit memorization, a form of inductive bias that constrains the functions learned through the optimization process to concentrate around the training examples, although the network could in principle represent a much larger function class. In particular, we prove that single-layer fully-connected autoencoders project data onto the (nonlinear) span of the training examples.
Information Geometry of Orthogonal Initializations and Training. [paper]
- Piotr A. Sokol, Il Memming Park. ICLR 2020
- Key Word: Mean Field Theory; Information Geometry.
- Digest
  We show a novel connection between the maximum curvature of the optimization landscape (gradient smoothness) as measured by the Fisher information matrix (FIM) and the spectral radius of the input-output Jacobian, which partially explains why more isometric networks can train much faster.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks. [paper]
- Simon S. Du, Xiyu Zhai, Barnabas Poczos, Aarti Singh. ICLR 2019
- Key Word: Gradient Descent; Gradient Dynamics.
- Digest
  One of the mysteries in the success of neural networks is randomly initialized first order methods like gradient descent can achieve zero training loss even though the objective function is non-convex and non-smooth. This paper demystifies this surprising phenomenon for two-layer fully connected ReLU activated neural networks. For an m hidden node shallow neural network with ReLU activation and n training data, we show as long as m is large enough and no two inputs are parallel, randomly initialized gradient descent converges to a globally optimal solution at a linear convergence rate for the quadratic loss function.
Dynamical Isometry is Achieved in Residual Networks in a Universal Way for any Activation Function. [paper]
- Wojciech Tarnowski, Piotr Warchoł, Stanisław Jastrzębski, Jacek Tabor, Maciej A. Nowak. AISTATS 2019
- Key Word: Mean Field Theory.
- Digest
  We demonstrate that in residual neural networks (ResNets) dynamical isometry is achievable irrespectively of the activation function used. We do that by deriving, with the help of Free Probability and Random Matrix Theories, a universal formula for the spectral density of the input-output Jacobian at initialization, in the large network width and depth limit.
Mean Field Analysis of Neural Networks: A Central Limit Theorem. [paper]
- Justin Sirignano, Konstantinos Spiliopoulos.
- Key Word: Mean Field Theory.
- Digest
  We rigorously prove a central limit theorem for neural network models with a single hidden layer. The central limit theorem is proven in the asymptotic regime of simultaneously (A) large numbers of hidden units and (B) large numbers of stochastic gradient descent training iterations. Our result describes the neural network's fluctuations around its mean-field limit. The fluctuations have a Gaussian distribution and satisfy a stochastic partial differential equation.
An elementary introduction to information geometry. [paper]
- Frank Nielsen.
- Key Word: Survey; Information Geometry.
- Digest
  In this survey, we describe the fundamental differential-geometric structures of information manifolds, state the fundamental theorem of information geometry, and illustrate some use cases of these information manifolds in information sciences. The exposition is self-contained by concisely introducing the necessary concepts of differential geometry, but proofs are omitted for brevity.
Deep Convolutional Networks as shallow Gaussian Processes. [paper] [code]
- Adrià Garriga-Alonso, Carl Edward Rasmussen, Laurence Aitchison. ICLR 2019
- Key Word: Gaussian Process.
- Digest
  We show that the output of a (residual) convolutional neural network (CNN) with an appropriate prior over the weights and biases is a Gaussian process (GP) in the limit of infinitely many convolutional filters, extending similar results for dense networks. For a CNN, the equivalent kernel can be computed exactly and, unlike "deep kernels", has very few parameters: only the hyperparameters of the original CNN.
Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data. [paper]
- Yuanzhi Li, Yingyu Liang. NeurIPS 2018
- Key Word: Stochastic Gradient Descent.
- Digest
  Neural networks have many successful applications, while much less theoretical understanding has been gained. Towards bridging this gap, we study the problem of learning a two-layer overparameterized ReLU neural network for multi-class classification via stochastic gradient descent (SGD) from random initialization. In the overparameterized setting, when the data comes from mixtures of well-separated distributions, we prove that SGD learns a network with a small generalization error, albeit the network has enough capacity to fit arbitrary labels.
Neural Ordinary Differential Equations. [paper] [code]
- Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud. NeurIPS 2018
- Key Word: Ordinary Differential Equations; Normalizing Flow.
- Digest
  We introduce a new family of deep neural network models. Instead of specifying a discrete sequence of hidden layers, we parameterize the derivative of the hidden state using a neural network. We also construct continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, we show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models.
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks. [paper] [code]
- Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington. ICML 2018
- Key Word: Mean Field Theory.
- Digest
  We demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by developing a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix.
Universal Statistics of Fisher Information in Deep Neural Networks: Mean Field Approach. [paper]
- Ryo Karakida, Shotaro Akaho, Shun-ichi Amari. AISTATS 2019
- Key Word: Mean Field Theory; Fisher Information.
- Digest
  The Fisher information matrix (FIM) is a fundamental quantity to represent the characteristics of a stochastic model, including deep neural networks (DNNs). The present study reveals novel statistics of FIM that are universal among a wide class of DNNs. To this end, we use random weights and large width limits, which enables us to utilize mean field theories. We investigate the asymptotic statistics of the FIM's eigenvalues and reveal that most of them are close to zero while the maximum eigenvalue takes a huge value.
Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks. [paper] [code]
- Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, Nathan Srebro. ICLR 2019
- Key Word: Over-Parametrization.
- Digest
  We suggest a novel complexity measure based on unit-wise capacities resulting in a tighter generalization bound for two layer ReLU networks. Our capacity bound correlates with the behavior of test error with increasing network sizes (within the range reported in the experiments), and could partly explain the improvement in generalization with over-parametrization.
Understanding Generalization and Optimization Performance of Deep CNNs. [paper]
- Pan Zhou, Jiashi Feng. ICML 2018
- Key Word: Generalization of CNNs.
- Digest
  We make multiple contributions to understand deep CNNs theoretically. To our best knowledge, this work presents the first theoretical guarantees on both generalization error bound without exponential growth over network depth and optimization performance for deep CNNs.
Geometric Understanding of Deep Learning. [paper]
- Na Lei, Zhongxuan Luo, Shing-Tung Yau, David Xianfeng Gu.
- Key Word: Manifold Representation; Learning Capability; Learning Capability; Latent Probability Distribution Control.
- Digest
  In this work, we give a geometric view to understand deep learning: we show that the fundamental principle attributing to the success is the manifold structure in data, namely natural high dimensional data concentrates close to a low-dimensional manifold, deep learning learns the manifold and the probability distribution on it.
Gaussian Process Behaviour in Wide Deep Neural Networks. [paper] [code]
- Alexander G. de G. Matthews, Mark Rowland, Jiri Hron, Richard E. Turner, Zoubin Ghahramani. ICLR 2018
- Key Word: Gaussian Process.
- Digest
  We study the relationship between random, wide, fully connected, feedforward networks with more than one hidden layer and Gaussian processes with a recursive kernel definition. We show that, under broad conditions, as we make the architecture increasingly wide, the implied random function converges in distribution to a Gaussian process, formalising and extending existing results by Neal (1996) to deep networks.
How to Start Training: The Effect of Initialization and Architecture. [paper]
- Boris Hanin, David Rolnick. NeurIPS 2018
- Key Word: Neuron Activation; Weight Initialization.
- Digest
  We identify and study two common failure modes for early training in deep ReLU nets. The first failure mode, exploding/vanishing mean activation length, can be avoided by initializing weights from a symmetric distribution with variance 2/fan-in and, for ResNets, by correctly weighting the residual modules. We prove that the second failure mode, exponentially large variance of activation length, never occurs in residual nets once the first failure mode is avoided.
The Emergence of Spectral Universality in Deep Networks. [paper]
- Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli. AISTATS 2018
- Key Word: Mean Field Theory.
- Digest
  We leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity.
Generalization in Machine Learning via Analytical Learning Theory. [paper] [code]
- Kenji Kawaguchi, Yoshua Bengio, Vikas Verma, Leslie Pack Kaelbling.
- Key Word: Regularization; Measure Theory.
- Digest
  This paper introduces a novel measure-theoretic theory for machine learning that does not require statistical assumptions. Based on this theory, a new regularization method in deep learning is derived and shown to outperform previous methods in CIFAR-10, CIFAR-100, and SVHN. Moreover, the proposed theory provides a theoretical basis for a family of practically successful regularization methods in deep learning. We discuss several consequences of our results on one-shot learning, representation learning, deep learning, and curriculum learning. Unlike statistical learning theory, the proposed learning theory analyzes each problem instance individually via measure theory, rather than a set of problem instances via statistics. As a result, it provides different types of results and insights when compared to statistical learning theory.
Stronger generalization bounds for deep nets via a compression approach [paper]
- Sanjeev Arora, Rong Ge, Behnam Neyshabur, Yi Zhang. ICML 2018
- Key Word: PAC-Bayes; Compression-Based Generalization Bound.
- Digest
  A simple compression framework for proving generalization bounds, perhaps a more explicit and intuitive form of the PAC-Bayes work. It also yields elementary short proofs of recent generalization results.
Which Neural Net Architectures Give Rise To Exploding and Vanishing Gradients? [paper]
- Boris Hanin. NeurIPS 2018
- Key Word: Network Architectures.
- Digest
  We give a rigorous analysis of the statistical behavior of gradients in a randomly initialized fully connected network N with ReLU activations. Our results show that the empirical variance of the squares of the entries in the input-output Jacobian of N is exponential in a simple architecture-dependent constant beta, given by the sum of the reciprocals of the hidden layer widths.

Others: 2017

Mean Field Residual Networks: On the Edge of Chaos. [paper]
- Greg Yang, Samuel S. Schoenholz. NeurIPS 2017
- Key Word: Mean Field Theory.
- Digest
  The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial.
Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. [paper]
- Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli. NeurIPS 2017
- Key Word: Mean Field Theory.
- Digest
  We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not.
Deep Neural Networks as Gaussian Processes. [paper]
- Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein. ICLR 2018
- Key Word: Gaussian Process.
- Digest
  In this work, we derive the exact equivalence between infinitely wide deep networks and GPs. We further develop a computationally efficient pipeline to compute the covariance function for these GPs.
Maximum Principle Based Algorithms for Deep Learning. [paper]
- Qianxiao Li, Long Chen, Cheng Tai, Weinan E. JMLR
- Key Word: Optimal control; Pontryagin’s Maximum Principle.
- Digest
  We discuss the viewpoint that deep residual neural networks can be viewed as discretization of a continuous-time dynamical system, and hence supervised deep learning can be regarded as solving an optimal control problem in continuous time.
When is a Convolutional Filter Easy To Learn? [paper]
- Simon S. Du, Jason D. Lee, Yuandong Tian. ICLR 2018
- Key Word: Gradient Descent.
- Digest
  We show that (stochastic) gradient descent with random initialization can learn the convolutional filter in polynomial time and the convergence rate depends on the smoothness of the input distribution and the closeness of patches. To the best of our knowledge, this is the first recovery guarantee of gradient-based algorithms for convolutional filter on non-Gaussian input distributions.
Implicit Regularization in Deep Learning. [paper]
- Behnam Neyshabur. PhD Thesis
- Key Word: Implicit Regularization.
- Digest
  In an attempt to better understand generalization in deep learning, we study several possible explanations. We show that implicit regularization induced by the optimization method is playing a key role in generalization and success of deep learning models. Motivated by this view, we study how different complexity measures can ensure generalization and explain how optimization algorithms can implicitly regularize complexity measures.
Exploring Generalization in Deep Learning. [paper] [code]
- Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, Nathan Srebro. NeurIPS 2017
- Key Word: PAC-Bayes.
- Digest
  With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.
Gradient Descent Can Take Exponential Time to Escape Saddle Points. [paper]
- Simon S. Du, Chi Jin, Jason D. Lee, Michael I. Jordan, Barnabas Poczos, Aarti Singh. NeurIPS 2017
- Key Word: Gradient Descent; Saddle Points.
- Digest
  We established the failure of gradient descent to efficiently escape saddle points for general non-convex smooth functions. We showed that even under a very natural initialization scheme, gradient descent can require exponential time to converge to a local minimum whereas perturbed gradient descent converges in polynomial time. Our results demonstrate the necessity of adding perturbations for efficient non-convex optimization.
How to Escape Saddle Points Efficiently. [paper]
- Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, Michael I. Jordan. ICML 2017
- Key Word: Gradient Descent; Saddle Points.
- Digest
  This paper presents the first (nearly) dimension-free result for gradient descent in a general nonconvex setting. We present a general convergence result and show how it can be further strengthened when combined with further structure such as strict saddle conditions and/or local regularity/convexity.

Others: 2016

Understanding Deep Neural Networks with Rectified Linear Units. [paper]
- Raman Arora, Amitabh Basu, Poorya Mianjy, Anirbit Mukherjee. ICLR 2018
- Key Word: ReLU.
- Digest
  In this paper we investigate the family of functions representable by deep neural networks (DNN) with rectified linear units (ReLU). We give an algorithm to train a ReLU DNN with one hidden layer to *global optimality* with runtime polynomial in the data size albeit exponential in the input dimension. Further, we improve on the known lower bounds on size (from exponential to super exponential) for approximating a ReLU deep net function by a shallower ReLU net.
Deep Information Propagation. [paper]
- Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, Jascha Sohl-Dickstein. ICLR 2017
- Key Word: Mean Field Theory.
- Digest
  We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters.
Why Deep Neural Networks for Function Approximation? [paper]
- Shiyu Liang, R. Srikant. ICLR 2017
- Key Word: Function Approximation.
- Digest
  Recently there has been much interest in understanding why deep neural networks are preferred to shallow networks. We show that, for a large class of piecewise smooth functions, the number of neurons needed by a shallow network to approximate a function is exponentially larger than the corresponding number of neurons needed by a deep network for a given degree of function approximation. First, we consider univariate functions on a bounded interval and require a neural network to achieve an approximation error of ε uniformly over the interval. We show that shallow networks (i.e., networks whose depth does not depend on ε) require Ω(poly(1/ε)) neurons while deep networks (i.e., networks whose depth grows with 1/ε) require O(polylog(1/ε)) neurons.
Why does deep and cheap learning work so well? [paper]
- Henry W. Lin, Max Tegmark, David Rolnick. Journal of Statistical Physics
- Key Word: Physics.
- Digest
  We show how the success of deep learning could depend not only on mathematics but also on physics: although well-known mathematical theorems guarantee that neural networks can approximate arbitrary functions well, the class of functions of practical interest can frequently be approximated through "cheap learning" with exponentially fewer parameters than generic ones. We explore how properties frequently encountered in physics such as symmetry, locality, compositionality, and polynomial log-probability translate into exceptionally simple neural networks.
Exponential expressivity in deep neural networks through transient chaos. [paper] [code]
- Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, Surya Ganguli. NeurIPS 2016
- Key Word: Mean Field Theory; Riemannian Geometry.
- Digest
  We combine Riemannian geometry with the mean field theory of high dimensional chaos to study the nature of signal propagation in deep neural networks with random weights. Our results reveal a phase transition in the expressivity of random deep networks, with networks in the chaotic phase computing nonlinear functions whose global curvature grows exponentially with depth, but not with width. We prove that this generic class of random functions cannot be efficiently computed by any shallow network, going beyond prior work that restricts their analysis to single functions.