Contents: Features · Example · Details · Datasets · Models and notebooks · Repository structure · Installation · Running · References
The repository contains a modular Python implementation of transformer architectures for natural language understanding and generation tasks, according to:
- The seminal paper Attention Is All You Need by Vaswani et al.[1] that details the novel attention-based transformer architecture and its application to sequence-to-sequence tasks, demonstrating its effectiveness by achieving state-of-the-art performance in machine translation, surpassing previous LSTM and CNN based neural machine translation architectures.
- The chapter on Transformers and Large Language Models from Speech and Language Processing by Jurafsky & Martin[2] which provides a more comprehensive and illustrative look into some of the high-level details discussed in Attention Is All You Need.
- Generic encoder-only, decoder-only and encoder-decoder transformer architectures.
- Wrappers for causal language modelling, sequence-to-sequence generation and classification/regression.
- Various decoding methods for causal/sequence-to-sequence generation:
- Search-based (greedy and beam search)
- Sampling-based (nucleus, temperature and top-k sampling)
- Example applications to real-world datasets.
This project is implemented using PyTorch and PyTorch Lightning.
As PyTorch provides a number of transformer and attention related layers in its torch.nn
submodule, this project explicitly avoids the use of:
torch.nn.Transformer
torch.nn.TransformerEncoder
/torch.nn.TransformerEncoderLayer
torch.nn.TransformerDecoder
/torch.nn.TransformerDecoderLayer
torch.nn.MultiHeadAttention
torch.nn.functional.scaled_dot_product_attention
All other layers provided by torch.nn
are allowed, including:
nn.Embedding
: For token embedding look-up by vocabulary ID.nn.LayerNorm
: For layer normalization as implemented in Attention Is All You Need.
- Transformer models implemented and made available in other libraries such as HuggingFace's
transformers
are not used in this project. - However, the tokenizers provided by
transformers
were used, as developing tokenization algorithms was not the primary objective of this project. - No existing "x from scratch" resources were used, such as the famous Let's build GPT: from scratch, in code, spelled out. by Andrej Karpathy[3].
- No other online resources were used, apart from official documentation for packages such as PyTorch, PyTorch Lightning and Huggingface Tokenizers.
Training a causal language model to generate "Florida man"-style news headlines.
from transformers import LlamaTokenizer
from transformer.params import TransformerParams, TemperatureSamplingParams
from transformer.models import CausalLM
from transformer.decoding import TemperatureSamplingDecoder
# initialize HuggingFace tokenizer
tokenizer = LlamaTokenizer.from_pretrained(
"huggyllama/llama-7b", add_eos_token=True, legacy=False
)
tokenizer.add_special_tokens({"pad_token": "<pad>"})
# initialize the causal language model
model = CausalLM(
params=TransformerParams(context_length=64),
tokenizer=tokenizer,
)
# train the language model
model.train(...)
# initialize decoder for sequence generation
decoder = TemperatureSamplingDecoder(
params=TemperatureSamplingParams(max_length=100, temperature=0.5, k=5),
model=model,
)
# generation without context
decoder.generate()
'Florida man arrested after baby alligator, guns, drugs found inside truck'
# generation with context
decoder.generate("Florida man shot")
'Florida man shot and killed while attempting to steal pizza and Pokemon cards from Target'
While the original architecture described in Attention Is All You Need is an encoder-decoder based architecture using transformers for neural machine translation which is a sequence-to-sequence learning task, this project was designed to be more general, allowing for a variety of natural language tasks by implementing encoder-only, decoder-only and encoder-decoder architectures.
The following datasets were used to test the above transformer implementations on various tasks.
- arXiv Paper Abstracts: arXiv manuscripts and their metadata including titles, abstracts and categories.
- CommonLit Readability Prize: Literary passages and their associated "readability" score for use in grade 3-12 classrooms.
- Reddit r/FloridaMan: News headlines about various (often funny and irrational) actions performed by Florida men and women.
- Europarl: Transcriptions of European Parliament proceedings between 1996-2006, collected in 11 languages.
ClassifierLM
: A generic transformer-based language model for assigning classes to text.notebooks/arxiv_categorization.ipynb
applies this model to the arXiv Paper Abstracts dataset to categorize arXiv manuscripts based on their titles.
RegressorLM
: A generic transformer-based language model for assigning scores to text.notebooks/commonlit_readability.ipynb
applies this model to the CommonLit Readability Prize dataset to rate the complexity of literary passages for grade 3-12 students.
CausalLM
: A generic transformer-based language model for generating text in an autoregressive manner.notebooks/florida_man_generation.ipynb
applies this model to the Reddit r/FloridaMan dataset to generate humorous news headlines involving the (mis)adventures of Florida men and women.
Seq2SeqLM
: A generic transformer-based language model for generating output text given an input text.notebooks/arxiv_summarization.ipynb
applies this model to the arxiv Paper Abstracts dataset to generate arXiv paper titles by summarizing their corresponding abstracts.notebooks/europarl_translation.ipynb
applies this model to the Europarl dataset to translate transcribed parliamentiary proceedings from French to English.
notebooks/
: Notebooks applying the models intransformer.models
to various datasets.transformer/
: Core package containing the transformer implementations.dataloaders/
:LightningDataModule
s for each model intransformer.models
.decoding/
: Decoding method implementations for causal and sequence-to-sequence LMs.models/
: Task-specific transformers implemented usingtransformer.modules.transformers
.modules/
:LightningModule
s used within the transformers intransformer.models
.transformers/
: Encoder-only, decoder-only and encoder-decoder transformer definitions.attention.py
: Masked/unmasked multi-head self attention definition.block.py
: Transformer block definition.embedding.py
: Positional encoding and input embedding definition.
params/
: Pydantic hyper-parameter classes.utils/
: Supporting custom layers, functions and constants.
The transformer implementation is installable as a local Python package, named transformer
.
pip install -e .
To run the notebooks, you will need additional dependencies which can be installed with the notebooks
extra.
pip install -e ".[notebooks]"
This package was developed on Python 3.11.8, so it is recommended to use a virtual environment with the same version.
You should be able to simply run the Jupyter notebooks in the notebooks/
folder.
Beware, they take time – even with a good GPU (especially the sequence-to-sequence ones)!
© 2024-2025, Edwin Onuonga - Published under the terms of the MIT license.
Authored and maintained by Edwin Onuonga.