phamvanlinh143/awesome-very-deep-learning

awesome-very-deep-learning is a curated list for papers and code about implementing and training very deep neural networks.

Neural Ordinary Differential Equations

ODE Networks are a kind of continuous-depth neural network. Instead of specifying a discrete sequence of hidden layers, they parameterize the derivative of the hidden state using a neural network. The output of the network is computed using a black-box differential equation solver. These continuous-depth models have constant memory cost, adapt their evaluation strategy to each input, and can explicitly trade numerical precision for speed.

Papers

Neural Ordinary Differential Equations (2018) [original code], introduces several ODENets such as continuous-depth residual networks and continuous-time latent variable models. The paper also constructs continuous normalizing flows, a generative model that can train by maximum likelihood, without partitioning or ordering the data dimensions. For training, the authors show how to scalably backpropagate through any ODE solver, without access to its internal operations. This allows end-to-end training of ODEs within larger models. NIPS 2018 best paper.
Augmented Neural ODEs (2019), neural ODEs preserve topology, thus their learned flows can't intersect with each other. Therefore some functions can't be learned. Augmented NODEs improve upon this by adding an additional dimension to learn simpler flows.

Implementations

Authors Autograd Implementation

Value Iteration Networks

Value Iteration Networks are very deep networks that have tied weights and perform approximate value iteration. They are used as an internal (model-based) planning module.

Papers

Value Iteration Networks (2016) [original code], introduces VINs (Value Iteration Networks). The author shows that one can perform value iteration using iterative usage of convolutions and channel-wise pooling. It is able to generalize better in environments where a network needs to plan. NIPS 2016 best paper.

Densely Connected Convolutional Networks

Densely Connected Convolutional Networks are very deep neural networks consisting of dense blocks. Within dense blocks, each layer receives the feature maps of all preceding layers. This leverages feature reuse and thus substantially reduces the model size (parameters).

Papers

Densely Connected Convolutional Networks (2016) [original code], introduces DenseNets and shows that it outperforms ResNets in CIFAR10 and 100 by a large margin (especially when not using data augmentation), while only requiring half the parameters. CVPR 2017 best paper.

Implementations

Authors' Caffe Implementation
Authors' more memory-efficient Torch Implementation.
Tensorflow Implementation by Yixuan Li.
Tensorflow Implementation by Laurent Mazare.
Lasagne Implementation by Jan Schlüter.
Keras Implementation by tdeboissiere.
Keras Implementation by Roberto de Moura Estevão Filho.
Chainer Implementation by Toshinori Hanya.
Chainer Implementation by Yasunori Kudo.
PyTorch Implementation (including BC structures) by Andreas Veit
PyTorch Implementation

Deep Residual Learning

Deep Residual Networks are a family of extremely deep architectures (up to 1000 layers) showing compelling accuracy and nice convergence behaviors. Instead of learning a new representation at each layer, deep residual networks use identity mappings to learn residuals.

Papers

The Reversible Residual Network: Backpropagation Without Storing Activations [code] constructs reversible residual layers (no need to store activations) and surprisingly finds out that reversible layers don't impact final performance.
Squeeze-and-Excitation Networks [original code], introduces Squeeze-and-Excitation (SE) block, that adaptively recalibrates channel-wise feature responses. It achieved the 1st place on ILSVRC17.
Aggregated Residual Transformation for Deep Neural Networks (2016), introduces ResNeXt, which aggregates a set of transformations within a a res-block. It achieved the 2nd place on ILSVRC16.
Residual Networks of Residual Networks: Multilevel Residual Networks (2016), adds multi-level hierarchical residual mappings and shows that this improves the accuracy of deep networks
Wide Residual Networks (2016) [orginal code], studies wide residual neural networks and shows that making residual blocks wider outperforms deeper and thinner network architectures
Swapout: Learning an ensemble of deep architectures (2016), improving accuracy by randomly applying dropout, skipforward and residual units per layer
Deep Networks with Stochastic Depth (2016) [original code], dropout with residual layers as regularizer
Identity Mappings in Deep Residual Networks (2016) [original code], improving the original proposed residual units by reordering batchnorm and activation layers
Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning (2016), inception network with residual connections
Deep Residual Learning for Image Recognition (2015) [original code], original paper introducing residual neural networks

Implementations

Torch by Facebook AI Research (FAIR), with training code in Torch and pre-trained ResNet-18/34/50/101 models for ImageNet: blog, code
Torch, CIFAR-10, with ResNet-20 to ResNet-110, training code, and curves: code
Lasagne, CIFAR-10, with ResNet-32 and ResNet-56 and training code: code
Neon, CIFAR-10, with pre-trained ResNet-32 to ResNet-110 models, training code, and curves: code
Neon, Preactivation layer implementation: code
Torch, MNIST, 100 layers: blog, code
A winning entry in Kaggle's right whale recognition challenge: blog, code
Neon, Place2 (mini), 40 layers: blog, code
Tensorflow with tflearn, with CIFAR-10 and MNIST: code
Tensorflow with skflow, with MNIST: code
Stochastic dropout in Keras: code
ResNet in Chainer: code
Stochastic dropout in Chainer: code
Wide Residual Networks in Keras: code
ResNet in TensorFlow 0.9+ with pretrained caffe weights: code
ResNet in PyTorch: code
Ladder Network for Semi-Supervised Learning in Keras : code

In addition, this code by Ryan Dahl helps to convert the pre-trained models to TensorFlow.

Highway Networks

Highway Networks take inspiration from Long Short Term Memory (LSTM) and allow training of deep, efficient networks (with hundreds of layers) with conventional gradient-based methods

Papers

Recurrent Highway Networks (2016) [original code], introducing recurrent highway networks, which increases space depth in recurrent networks
Training Very Deep Networks (2015), introducing highway neural networks

Implementations

Lasagne: code
Caffe: code
Torch: code
Tensorflow: blog, code
PyTorch: code

Very Deep Learning Theory

Theories in very deep learning concentrate on the ideas that very deep networks with skip connections are able to efficiently approximate recurrent computations (similar to the recurrent connections in the visual cortex) or are actually exponential ensembles of shallow networks

Papers

Identity Matters in Deep Learning considers identity parameterizations from a theoretical perspective and proofs that arbitrarily deep linear residual networks have no spurious local optima
The Shattered Gradients Problem: If resnets are the answer, then what is the question? argues that gradients of very deep networks resemble white noise (thus are harder to optimize). Resnets are more resistant to shattering (decaying sublinearly)
Skip Connections as Effective Symmetry-Breaking hypothesizes that ResNets improve performance by breaking symmetries
Highway and Residual Networks learn Unrolled Iterative Estimation, argues that instead of learning a new representation at each layer, the layers within a stage rather work as an iterative refinement of the same features.
Demystifying ResNet, shows mathematically that 2-shortcuts in ResNets achieves the best results because they have non-degenerate depth-invariant initial condition numbers (in comparison to 1 or 3-shortcuts), making it easy for the optimisation algorithm to escape from the initial point.
Wider or Deeper? Revisiting the ResNet Model for Visual Recognition, extends results from Veit et al. and shows that it is actually a linear ensemble of subnetworks. Wide ResNet work well, because current very deep networks are actually over-deepened (hence not trained end-to-end), due to the much shorter effective path length.
Residual Networks are Exponential Ensembles of Relatively Shallow Networks, shows that ResNets behaves just like ensembles of shallow networks in test time. This suggests that in addition to describing neural networks in terms of width and depth, there is a third dimension: multiplicity, the size of the implicit ensemble
Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex, shows that ResNets with shared weights work well too although having fewer parameters
A Simple Way to Initialize Recurrent Networks of Rectified Linear Units, pre-ResNet Hinton paper that suggested, that the identity matrix could be useful for the initialization of deep networks
ResNet with one-neuron hidden layers is a Universal Approximator, ResNet increases representational power for narrow deep networks because the skip connection and one neuron per hidden layer can uniformly approximate any Lebesgue integrable function in d dimensions (in contrast to fully connected networks).

phamvanlinh143 / awesome-very-deep-learning

Neural Ordinary Differential Equations

Papers

Implementations

Value Iteration Networks

Papers

Densely Connected Convolutional Networks

Papers

Implementations

Deep Residual Learning

Papers

Implementations

Highway Networks

Papers

Implementations

Very Deep Learning Theory

Papers

About