NOTE: paused work on this until i can do quick, expressive training runs
Now that I have a 3090, I think it'll be an interesting exercise to go through key papers in deep learning history.
In an effort to cover my bases, I have the deep learning book with me.
This is not a repo about the contents of these papers,
instead it's a log of things I personally learned along
the way. Everything will be done in torch
.
Deep Learning (Ian J. Goodfellow, Yoshua Bengio and Aaron Courville), MIT Press, 2016.
pip install torch numpy pandas
- cybernetics + "model of a neuron" (McCulloch and Pitts, 1943; Hebb, 1949)
- perceptron (Rosenblatt, 1958)
- adaptive linear element (ADALINE)
- back-propagation (Rumelhart et al., 1986)
- deep learning (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007)
Important Papers:
- Paper linked in PyTorch SGD Implementation - Nesterov momentum On the importance of initialization and momentum in deep learning
- AdamW (the goal is not exactly convergence)
- Early Stopping https://github.com/Bjarten/early-stopping-pytorch#
The usecase was mostly for simple binary classifiers. To demonstrate the perceptron + adaline:
- use a single linear layer
- zero initialized weights + biases
- stochastic gradient descent (SGD) optimizer
- mean squared error (MSE) loss
This is probably the simplest form of backpropagation
Deep learning was heavily inspired by the brain, but most advancements were made by engineering.
- 1975 - 1980 introduced the neocognitron
- 1986 - connectionism / parallel distributed processing https://stanford.edu/~jlmcc/papers/PDP/Chapter1.pdf
- distributed representation https://web.stanford.edu/~jlmcc/papers/PDP/Chapter3.pdf
This is the idea that each input to a system should be represented by many features, and each feature should be involved in the representation of many possible inputs
1990s progress in modeling sequences with neural networks.
- Hochreiter (1991) and Bengio et al. (1994) identified some of thge fundamental mathematical difficulties in modeling long sequences.
- Hochreiter and Schmidhuber (1997) introduced long short-term memory (LSTM) network to resolve some difficulties.
- Kernel machines (Boser et al., 199; Cortes and Vapnik, 1995; Scholkopf et al., 1999) and graphical models (Jordan, 1998) achieved good results on many important tasks. (led to a decline in popularity with neural networks)
- Canadian Institute for Advanced Research (CIFAR) played a key role in keeping neural network research alive. This united machine learning groups led by Geoffrey Hinton, Yoshua Bengio, Yann LeCun.
- Perceptron (Rosenblatt, 1958, 1962)
- Adaptive linear element (Widrow and Hoff, 1960)
- Neocognitron (Fukushima, 1980)
- Early back-propagation network (Rumelhart et al., 1986b)
- Recurrent neural network for speech recognition (Robinson and Fallside, 1991)
- Multilayer perceptron for speech recognition (Bengio et al., 1991)
- Mean field sigmoid belief network (Saul et al., 1996)
- LeNet-5 (LeCun et al., 1998b)
- Echo state network (Jaeger and Haas, 2004)
- Deep belief network (Hinton et al., 2006)
- GPU-accelerated convolutional network (Chellapilla et al., 2006)
- Deep Boltzmann machine (Salakhutdinov and Hinton, 2009a)
- GPU-accelerated deep belief network (Raina et al., 2009)
- Unsupervised convolutional network (Jarrett et al., 2009)
- GPU-accelerated multilayer perceptron (Ciresan et al., 2010)
- OMP-1 network (Coates and Ng, 2011)
- Distributed autoencoder (Le et al., 2012)
- Multi-GPU convolutional network (Krizhevsky et al., 2012)
- COTS HPC unsupervised convolutional network (Coates et al., 2013)
- GoogLeNet (Szegedy et al., 2014a)
LSTMs were thought to revolutionize machine translation (Sutskever et al., 2014; Bahdanau et al., 2015) when the book was published back in 2016