attention-is-all-you-need attention-mechanism generation generative-ai gpt llm nlp nlu summarization torch transformer transformer-from-scratch translation

Transformers From Scratch

^{Contents:
Features ·
Example ·
Details ·
Datasets ·
Models and notebooks ·
Repository structure ·
Installation ·
Running ·
References}

The repository contains a modular Python implementation of transformer architectures for natural language understanding and generation tasks, according to:

The seminal paper Attention Is All You Need by Vaswani et al.^[1] that details the novel attention-based transformer architecture and its application to sequence-to-sequence tasks, demonstrating its effectiveness by achieving state-of-the-art performance in machine translation, surpassing previous LSTM and CNN based neural machine translation architectures.
The chapter on Transformers and Large Language Models from Speech and Language Processing by Jurafsky & Martin^[2] which provides a more comprehensive and illustrative look into some of the high-level details discussed in Attention Is All You Need.

Features

Generic encoder-only, decoder-only and encoder-decoder transformer architectures.
Wrappers for causal language modelling, sequence-to-sequence generation and classification/regression.
Various decoding methods for causal/sequence-to-sequence generation:
- Search-based (greedy and beam search)
- Sampling-based (nucleus, temperature and top-k sampling)
Example applications to real-world datasets.

PyTorch restrictions

This project is implemented using PyTorch and PyTorch Lightning.

As PyTorch provides a number of transformer and attention related layers in its torch.nn submodule, this project explicitly avoids the use of:

All other layers provided by torch.nn are allowed, including:

nn.Embedding: For token embedding look-up by vocabulary ID.
nn.LayerNorm: For layer normalization as implemented in Attention Is All You Need.

Other restrictions

Transformer models implemented and made available in other libraries such as HuggingFace's transformers are not used in this project.
However, the tokenizers provided by transformers were used, as developing tokenization algorithms was not the primary objective of this project.
No existing "x from scratch" resources were used, such as the famous Let's build GPT: from scratch, in code, spelled out. by Andrej Karpathy^[3].
No other online resources were used, apart from official documentation for packages such as PyTorch, PyTorch Lightning and Huggingface Tokenizers.

Example

Training a causal language model to generate "Florida man"-style news headlines.

from transformers import LlamaTokenizer

from transformer.params import TransformerParams, TemperatureSamplingParams
from transformer.models import CausalLM
from transformer.decoding import TemperatureSamplingDecoder

# initialize HuggingFace tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
    "huggyllama/llama-7b", add_eos_token=True, legacy=False
)
tokenizer.add_special_tokens({"pad_token": "<pad>"})

# initialize the causal language model
model = CausalLM(
    params=TransformerParams(context_length=64),
    tokenizer=tokenizer,
)

# train the language model
model.train(...)

# initialize decoder for sequence generation
decoder = TemperatureSamplingDecoder(
    params=TemperatureSamplingParams(max_length=100, temperature=0.5, k=5),
    model=model,
)

# generation without context
decoder.generate()
'Florida man arrested after baby alligator, guns, drugs found inside truck'

# generation with context
decoder.generate("Florida man shot")
'Florida man shot and killed while attempting to steal pizza and Pokemon cards from Target'

Details

While the original architecture described in Attention Is All You Need is an encoder-decoder based architecture using transformers for neural machine translation which is a sequence-to-sequence learning task, this project was designed to be more general, allowing for a variety of natural language tasks by implementing encoder-only, decoder-only and encoder-decoder architectures.

	Encoder-only	Decoder-only	Encoder-decoder
Diagram
Tasks	Contextualized embedding and supervised inference	Autoregressive generation	Sequence-to-sequence generation
Example use-cases	Producing contextualized token embeddings Sentiment classification Intent classification	Text generation	Machine translation Text summarization

Datasets

The following datasets were used to test the above transformer implementations on various tasks.

arXiv Paper Abstracts: arXiv manuscripts and their metadata including titles, abstracts and categories.
CommonLit Readability Prize: Literary passages and their associated "readability" score for use in grade 3-12 classrooms.
Reddit r/FloridaMan: News headlines about various (often funny and irrational) actions performed by Florida men and women.
Europarl: Transcriptions of European Parliament proceedings between 1996-2006, collected in 11 languages.

Models and notebooks

Encoder-only models

ClassifierLM: A generic transformer-based language model for assigning classes to text.
- notebooks/arxiv_categorization.ipynb applies this model to the arXiv Paper Abstracts dataset to categorize arXiv manuscripts based on their titles.
RegressorLM: A generic transformer-based language model for assigning scores to text.
- notebooks/commonlit_readability.ipynb applies this model to the CommonLit Readability Prize dataset to rate the complexity of literary passages for grade 3-12 students.

Decoder-only models

CausalLM: A generic transformer-based language model for generating text in an autoregressive manner.
- notebooks/florida_man_generation.ipynb applies this model to the Reddit r/FloridaMan dataset to generate humorous news headlines involving the (mis)adventures of Florida men and women.

Encoder-decoder models

Seq2SeqLM: A generic transformer-based language model for generating output text given an input text.
- notebooks/arxiv_summarization.ipynb applies this model to the arxiv Paper Abstracts dataset to generate arXiv paper titles by summarizing their corresponding abstracts.
- notebooks/europarl_translation.ipynb applies this model to the Europarl dataset to translate transcribed parliamentiary proceedings from French to English.

Repository structure

notebooks/: Notebooks applying the models in transformer.models to various datasets.
transformer/: Core package containing the transformer implementations.
- dataloaders/: LightningDataModules for each model in transformer.models.
- decoding/: Decoding method implementations for causal and sequence-to-sequence LMs.
- models/: Task-specific transformers implemented using transformer.modules.transformers.
- modules/: LightningModules used within the transformers in transformer.models.
  - transformers/: Encoder-only, decoder-only and encoder-decoder transformer definitions.
  - attention.py: Masked/unmasked multi-head self attention definition.
  - block.py: Transformer block definition.
  - embedding.py: Positional encoding and input embedding definition.
- params/: Pydantic hyper-parameter classes.
- utils/: Supporting custom layers, functions and constants.

Installation

The transformer implementation is installable as a local Python package, named transformer.

pip install -e .

To run the notebooks, you will need additional dependencies which can be installed with the notebooks extra.

pip install -e ".[notebooks]"

This package was developed on Python 3.11.8, so it is recommended to use a virtual environment with the same version.

Running

You should be able to simply run the Jupyter notebooks in the notebooks/ folder.

Beware, they take time – even with a good GPU (especially the sequence-to-sequence ones)!

References

[1]	Vaswani et al., "Attention Is All You Need", Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS 2017), 6000-6010.
[2]	Dan Jurafsky & James H. Martin, "Transformers and Large Language Models", Speech and Language Processing, 3rd ed. draft (2024), ch. 10.
[3]	Andrej Karpathy "Let's build GPT: from scratch, in code, spelled out.", YouTube (2023)

About

Modular Python implementation of encoder-only, decoder-only and encoder-decoder transformer architectures from scratch, as detailed in Attention Is All You Need.

attention-is-all-you-need attention-mechanism generation generative-ai gpt llm nlp nlu summarization torch transformer transformer-from-scratch translation

MIT License

Languages

Language:Python 100.0%