Hopfield Networks is All You Need

Hubert Ramsauer¹, Bernhard Schäfl¹, Johannes Lehner¹, Philipp Seidl¹, Michael Widrich¹, Lukas Gruber¹, Markus Holzleitner¹, Milena Pavlović^{3, 4}, Geir Kjetil Sandve⁴, Victor Greiff³, David Kreil², Michael Kopp², Günter Klambauer¹, Johannes Brandstetter¹, Sepp Hochreiter^{1, 2}

¹ ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria
² Institute of Advanced Research in Artificial Intelligence (IARAI)
³ Department of Immunology, University of Oslo, Norway
⁴ Department of Informatics, University of Oslo, Norway

Detailed blog post on this paper as well as the necessary background on Hopfield networks at this link.

The transformer and BERT models pushed the performance on NLP tasks to new levels via their attention mechanism. We show that this attention mechanism is the update rule of a modern Hopfield network with continuous states. This new Hopfield network can store exponentially (with the dimension) many patterns,converges with one update, and has exponentially small retrieval errors. The number of stored patterns must be traded off against convergence speed and retrieval error. The new Hopfield network has three types of energy minima (fixed points of the update):

global fixed point averaging over all patterns,
metastable states averaging over a subset of patterns, and
fixed points which store a single pattern.

Transformers learn an attention mechanism by constructing an embedding of patterns and queries into an associative space. Transformer and BERT models operate in their first layers preferably in the global averaging regime, while they operate in higher layers in metastable states. The gradient in transformers is maximal in the regime of metastable states, is uniformly distributed when averaging globally, and vanishes when a fixed point is near a stored pattern. Based on the Hopfield network interpretation, we analyzed learning of transformer and BERT architectures. Learning starts with attention heads that average and then most of them switch to metastable states. However, the majority of heads in the first layers still averages and can be replaced by averaging operations like the Gaussian weighting that we propose. In contrast, heads in the last layers steadily learn and seem to use metastable states to collect information created in lower layers. These heads seem a promising target for improving transformers. Neural networks that integrate Hopfield networks that are equivalent to attention heads outperform other methods on immune repertoire classification, where the Hopfield net stores several hundreds of thousands of patterns.

With this repository, we provide a PyTorch implementation of a new layer called “Hopfield” which allows to equip deep learning architectures with Hopfield networks as new memory concepts.

The full paper is available at https://arxiv.org/abs/2008.02217.

Requirements

The software was developed and tested on the following 64-bit operating systems:

CentOS Linux release 8.1.1911 (Core)
macOS 10.15.5 (Catalina)

As the development environment, Python 3.8.3 in combination with PyTorch 1.6.0 was used (a version of at least 1.5.0 should be sufficient). More details on how to install PyTorch are available on the official project page.

Usage

To get up and running with Hopfield-based networks, only one argument needs to be set, the size (depth) of the input.

hopfield = Hopfield(input_size=...)

It is also possible to replace commonly used pooling functions with a Hopfield-based one. Internally, a state pattern is trained, which in turn is used to compute pooling weights with respect to the input.

hopfield_pooling = HopfieldPooling(input_size=...)

The usage is as simple as with the main module, but equally powerful.

Examples

Generally, the Hopfield layer is designed to be used to implement or to substitute different layers like:

Pooling layers: We consider the Hopfield layer as a pooling layer if only one static state (query) pattern exists. Then, it is de facto a pooling over the sequence, which results from the softmax values applied on the stored patterns. Therefore, our Hopfield layer can act as a pooling layer.
Permutation equivariant layers: Our Hopfield layer can be used as a plug-in replacement for permutation equivariant layers. Since the Hopfield layer is an associative memory it assumes no dependency between the input patterns.
GRU & LSTM layers: Our Hopfield layer can be used as a plug-in replacement for GRU & LSTM layers. Optionally, for substituting GRU & LSTM layers, positional encoding might be considered.
Attention layers: Our Hopfield layer can act as an attention layer, where state (query) and stored (key) patterns are different, and need to be associated.

The folder examples contains multiple demonstrations on how to use the Hopfield as well as the HopfieldPooling modules. To successfully run the contained Jupyter notebooks, additional third-party modules like pandas and seaborn are required.

Bit Pattern Set: The dataset of this demonstration falls into the category of binary classification tasks in the domain of Multiple Instance Learning (MIL) problems. Each bag comprises a collection of bit pattern instances, wheres each instance is a sequence of 0s and 1s. The positive class has specific bit patterns injected, which are absent in the negative one. This demonstration shows, that Hopfield and HopfieldPooling are capable of learning and filtering each bag with respect to the class-defining bit patterns.
Latch Sequence Set: We study an easy example of learning long-term dependencies by using a simple latch task, see Hochreiter and Mozer. The essence of this task is that a sequence of inputs is presented, beginning with one of two symbols, A or B, and after a variable number of time steps, the model has to output a corresponding symbol. Thus, the task requires memorizing the original input over time. It has to be noted, that both class-defining symbols must only appear at the first position of a sequence. This task was specifically designed to demonstrate the capability of recurrent neural networks to capture long term dependencies. This demonstration shows, that Hopfield and HopfieldPooling adapt extremely fast to this specific task, concentrating only on the first entry of the sequence.

Disclaimer

Some implementations of this repository are based on existing ones of the official PyTorch repository v1.6.0 and accordingly extended and modified. In the following, the involved parts are listed:

The implementation of HopfieldCore is based on the implementation of MultiheadAttention.
The implementation of hopfield_core_forward is based on the implementation of multi_head_attention_forward.
The implementation of HopfieldEncoderLayer is based on the implementation of TransformerEncoderLayer.
The implementation of HopfieldDecoderLayer is based on the implementation of TransformerDecoderLayer.

License

This repository is BSD-style licensed (see LICENSE), except where noted otherwise.

satpreetsingh / hopfield-layers