Tricycle

Tricycle is a fast, minimal, fully functional deep learning library written from scratch using only python and Numpy.

The file train_smol_gpt.py can train GPT-2 (124M), ~5 days on a single RTX 3090.

The entire library, from the automatic differentiation engine to a GPT, is written in ~4500 lines of python + Numpy code.

Using CuPY, all Tricycle code can run on either a CUDA-capable GPU or a CPU.

Tricycle

Installation

Tricycle uses conda to manage dependencies. While we do support CPU-only computation, optimisation efforts have been focussed on GPU computation so it is pretty slow. If you do have a CUDA capable GPU I would strongly recommend installing the gpu version of Tricycle.

If you have a CUDA capable GPU, you can install Tricycle as follows.

conda env create -f requirements/environment.yml -n tricycle
conda activate tricycle

CPU and test installation

If you want to install test dependencies you can do the following.

conda env create -f requirements/environment.test.yml -n tricycle
conda activate tricycle

CPU Installation

If you want to install Tricycle for CPU, you can do the following.

conda env create -f environment.cpu.yml -n tricycle
conda activate tricycle

If you want to install test dependencies on CPU you can do the following.

conda env create -f environment.cpu.test.yml -n tricycle
conda activate tricycle

Training a GPT on Shakespeare

The following toy script will train a small GPT to generate convincing Shakespeare. On my RTX 3090, this takes ~30 mins. For a more realistic training script with metric tracking, gradient accumulation, a validation dataset etc, take a look at train_smol_gpt.py

import pickle

from tqdm import tqdm

from tricycle.configs import ShakespeareConfig
from tricycle.dataset import CausalLMDataset
from tricycle.loss import CrossEntropy
from tricycle.models import GPT
from tricycle.optimisers import AdamW
from tricycle_datasets.shakespeare import Shakespeare

config = ShakespeareConfig()
model = GPT(config)

tokens = Shakespeare(vocab_size=config.vocab_size)
dataset = (
    CausalLMDataset(
        tokens=tokens,
        vocab_size=config.vocab_size,
        batch_size=config.batch_size,
        context_window=config.context_window,
    )
    .batch()
    .shuffle()
    .to_tensor()
)
loss_fn = CrossEntropy()
optimiser = AdamW(
    learning_rate=config.max_learning_rate,
    weight_decay=config.weight_decay,
    betas=(config.beta1, config.beta2),
)

model.to_gpu()
loading_bar = tqdm(range(config.steps))
for step in loading_bar:
    optimiser.step()
    inputs, outputs = next(dataset)
    inputs = inputs.to_gpu()
    outputs = outputs.to_gpu()

    logits = model(inputs)
    loss = loss_fn(outputs, logits)
    loss.backward()

    loading_bar.set_description(f"loss: {loss:.3f}")
    model.update(optimiser)

# save results
with open("model.pkl", "wb") as f:
    pickle.dump(model, f)

Once trained, you can generate infinite shakespeare plays as follows:

python inference.py model.pkl

How it works

Tricycle code centers around objects called Tensors. A Tensor is a wrapper around a Numpy array that adds some extra features:

from tricycle.tensor import to_tensor

tensor = to_tensor([1,2,3])
print(tensor) # Output: Tensor([1. 2. 3.])

You can do a lot of things with a tensor

from tricycle.functions import Softmax

a = to_tensor([1,2,3])
b = to_tensor([4,5,6])

# addition
print(a + b) # Output: Tensor([5. 7. 9.], name=badd)

# comparison
print(a < b) # Output: Tensor([ True  True  True])

# more complex functions
print(Softmax()(a)) # Output: Tensor([0.09003057 0.24472848 0.66524094], name=softmax)

Automatic Differentiation

Unlike vanilla Numpy, every operation in Tricycle is attached to a derivative. When you do some operations on your Tensor, Tricycle keeps track of what you did and allows you to differentiate the output.

x = to_tensor(2)

y = x ** 2 + 3 * x + 4
print(y) # Output: Tensor(14.0, name=+ 4)

# derivative of y with respect to (wrt) x is
# 2 * x + 3 = 7
y.backward() # differentiate wrt y
print(x.grad) # Output: Tensor(7.0)

This works on multidimensional tensors

import numpy as np

shape = (6,5,4,3,2)
a = to_tensor(np.random.random(shape))
b = to_tensor(np.random.random(shape))

c = a * b # elementwise multiply

c.backward() # differentiate wrt c
assert a.grad.close_to(b) # derivative of c wrt a is b
assert b.grad.close_to(a) # derivative of c wrt b is a

And even works through complex operations like attention

from tricycle.blocks import MultiHeadSelfAttention

attention = MultiHeadSelfAttention(
    embedding_dim=32,
    n_heads=2,
    context_window=32,
)

# batch_size, n_tokens, embedding_dim
shape = (4,32,32)
input = to_tensor(np.ones(shape), is_batched=True)

output = attention(input)
output.backward() # differentiate wrt output

print(input.grad) # Output: Tensor([[[ 2.5441039  -2.0558214  -1.7923143  ...
assert input.grad.shape == (4,32,32)

When you run an operation (Op), the output has two pieces of information attached:

args: The inputs to the function
back_fns: The functions that should be executed to calculate the derivative wrt each of the inputs

Surprisingly, this all that you need to perform automatic differentiation on an arbitrarily complicated sequence of Ops. Because we keep track of the args for each operation, we can start at the output of a set of Ops and traverse through them to reach every input to the sequence: the operations form a tree.

Thanks to the chain rule, if we apply each back_fn that we pass through on our way through the tree, when we get to an input, we will have calculated the derivative of the output wrt the input. Despite implementing it myself, I still feel like this couldn't possibly work, and yet it does!

The entirety of the algorithm can be found in tensor.py.

It ends up being a topological sort to figure out which order to traverse the tree and then a simple traversal, applying the back_fns along the way.

If you want a more detailed explanation, I've talked about it on my blog.

Einsum

Tricycle makes use of (in my opinion underutilised) einsum operations. Einsum is a generalisation of a large number of matrix operations.

You can use it by assigning each axis in your matrices a letter of the alphabet (called an index). You can define the operation you want to perform by simply listing the indices you want in your inputs and output, separated by an arrow.

For example, you can define the transpose of a 2d tensor as follows:

from tricycle.einsum import Einsum

a = to_tensor([[1,2],[3,4]])
print(Einsum("ij->ji")(a)) # Output: Tensor([[1. 3.], [2. 4.]], name=einsum ij->ji)

Here, we use einsum to swap indices i and j: a transpose.

There are only two rules to remember with einsum:

If an index does not appear in the output, any inputs that contain it will be summed along that axis:
```
print(Einsum("ij->i")(a)) # Tensor([3. 7.], name=einsum ij->i)
```

If an index appears in more than one input, the tensors will be multiplied along that axis

b = to_tensor([[5,6],[7,8])
print(Einsum("ij,jk->ik")(a,b)) # Tensor([[19. 22.], [43. 50.]], name=einsum ij,jk->ik)

For example:

Summing along an axis

EinsumIJToJ.mp4

Sum of an entire tensor

EinsumIJTo.mp4

Transpose

EinsumIJToJI.mp4

Matrix multiplication

EinsumIJKToIK.mp4

Because every Op in Tricycle needs a derivative, we need to figure out what the derivative of Einsum is. Thankfully, if you sit down and go through the maths (index notation is really helpful here) you'll find that you can follow these two, really simple rules to differentiate an einsum operation wrt a given input:

Swap the indices for the input and output
Replace the original input with your current derivative

For example, the derivative of a transpose works like this:

# forward operation
y = Einsum('ij->ji')(a)

# swap the input with the current grad (a grid of ones in this case)
grad = to_tensor(np.ones_like(y))

# swap the indices
derivative = Einsum('ji->ij')(grad)

And for a more complex operation (a dense layer on a 4d input) like this:

# forward operation
input = to_tensor(np.random.random((5, 4, 3, 2)))
weights = to_tensor(np.random.random((3,6)))
y = Einsum('zxTb,bW->zxTW')(inputs, weights)

grad = to_tensor(np.ones_like(y))

# swap the indices + replace inputs
derivative = Einsum('zxTb,zxTW->bW')(inputs, grad)

This little trick significantly simplifies code, as well as reducing the amount of maths I had to do to implement different operations.

Building a simple neural network

Einsum and an automatic differentiation engine are all we need to build a simple neural network. Lets try to train a model on the iris dataset We can start with a Dense Layer.

from tricycle.layers import Dense

x = to_tensor([1,2,3])
layer = Dense(from_size=3, to_size=1)

print(layer(x)) # Output: Tensor([-2.238703], name=dense)

Next, neural networks need a non-linearity (otherwise they reduce to expensive linear regressions).

Tricycle has a few non-linearities (also called activation functions). Here we can choose the simplest: ReLU.

from tricycle.activation import ReLU

x = to_tensor([-1, 0, 1])
activation_fn = ReLU()

print(activation_fn(x)) # Output: Tensor([0. 0. 1.], name=> 0)

We also need a loss function. We're predicting a category so we can use CrossEntropy

from tricycle.loss import CrossEntropy

label = to_tensor([0, 1, 2], dtype=int)
predicted = to_tensor([[0,0,1], [0,0,1], [0,0,1]])
loss = CrossEntropy()

print(loss(label, predicted)) # Output: Tensor(1.2181114, name=cross_entropy)

Finally, we need an optimiser to update our weights. We can use Stochastic Gradient Descent. In Tricycle, you can use an optimiser the weights of a model as follows:

from tricycle.activation import ReLU
from tricycle.layers import Dense, Sequential
from tricycle.optimisers import StochasticGradientDescent

# build a model
layer_1 = Dense(4, 16)
layer_2 = Dense(16, 3)
relu = ReLU()
model = Sequential(layer_1, relu, layer_2)

# create an optimiser
optimiser = StochasticGradientDescent(learning_rate=1e-1)

# do a forward and backward pass
x = to_tensor([1,2,3,4])
out = model(x)
out.backward()

# update the weights
model.update(optimiser)

We can put all of this together to train a simple neural network on the iris dataset.

import numpy as np
from sklearn.datasets import load_iris

from tricycle.activation import ReLU
from tricycle.tensor import to_tensor
from tricycle.layers import Dense, Sequential
from tricycle.loss import CrossEntropy
from tricycle.optimisers import StochasticGradientDescent

LEARNING_RATE = 1e-1
N_STEPS = 1000

np.random.seed(42)
X, y = load_iris(return_X_y=True)
inputs = to_tensor(X, is_batched=True)

# The class labels need to be ints for cross entropy
outputs = to_tensor(y, is_batched=True, dtype=int)

# create a model
layer_1 = Dense(4, 16)
layer_2 = Dense(16, 3)
relu = ReLU()
model = Sequential(layer_1, relu, layer_2)

loss_fn = CrossEntropy()
optimiser = StochasticGradientDescent(learning_rate=LEARNING_RATE)

for step in range(N_STEPS):
    y_pred = model(inputs)
    loss = loss_fn(outputs, y_pred)
    if step == 0:
        print(f"Initial loss: {loss}") # Output: Initial loss: Tensor(3.974701, name=cross_entropy)

    loss.backward()
    model.update(optimiser)

print(f"Final loss: {loss}") # Output: Final loss: Tensor(0.08622341, name=cross_entropy)

# Calculate accuracy
predicted_labels = np.argmax(y_pred.array, axis=-1)
accuracy = (predicted_labels == outputs.array).mean()
print(f"Accuracy: {accuracy:.2f}") # Output: Accuracy: 0.97

Optimisations

Deep learning is famously computationally heavy. If we want to train anything in a reasonable amount of time, there are several optimisations we need to make.

Batching

The first, and arguably most important, optimisation is batching. Instead of applying operations to each input individually, if we are clever about how we design an operation, we can apply an operation to multiple operations at once.

For example, suppose we are multiplying a batch of tensors by a weight matrix. We could do it like this:

# batch of 1024 64x64 tensors
inputs = to_tensor(np.ones((1024, 64, 64)))
weights = to_tensor(np.random.random((64,64)))

output = [Einsum('ij,jk->ik')(inp, weights) for inp in inputs]
# 62.2 ms ± 186 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

But we can use the properties of Einsum to do the same thing like this

output = Einsum('aij,jk->aik')(inputs, weights)
# 29.1 ms ± 99.2 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Which is more than 2x faster.

Some Ops in tricycle behave slightly differently, depending on whether a tensor batched or not. You can tell tricycle to use the batched version of Ops for a tensor by simply calling .to_batched. To convert it back, you can call .from_batched.

GPU

As well as batching, another improvement that has a big impact on performance is using a GPU. For this, we can use a library called CuPY. CuPY lets you run Numpy code on a GPU. This means that we can use the same code for CPU as well as GPU computation which greatly simplifies the codebase ( and avoids me needing to write CUDA kernels, for now).

Every tensor in tricycle has an .xp method. By default, this is just the Numpy library:

import numpy as np

tensor = to_tensor([1,2,3])

assert tensor.xp == np

But if you call .to_gpu on a tensor, this is the Cupy library:

import cupy as cp

tensor = to_tensor([1,2,3])

tensor.to_gpu()

assert tensor.xp == cp

(xp stands for np or cp because x is an "unknown"). This is really handy because it lets us write functions like this:

def forward(self, tensor: Tensor):
    """
    Apply softmax. The softmax is only applied to the final
    dimension of the tensor
    Note: the tensor is normalised for numeric stability
    """
    xp = tensor.xp

    exp = xp.exp(
        # subtract the largest value for numeric stability
        tensor.array - xp.max(tensor.array, axis=-1, keepdims=True)
    )
    denominator = xp.sum(exp, axis=-1, keepdims=True)
    self._out = exp / denominator

    result = to_tensor(self._out)
    result.args = (tensor,)
    result.name = "softmax"
    result.is_batched = tensor.is_batched
    result.back_fns = (self.backward,)

    return result

Because Cupy has the same interface as Numpy, this function will automatically run on the right device, with no code changes.

Fusing

One of the problems I faced when trying to use Tricycle is that it used up a lot more memory than I expected. Because the args and back_fns need to be stored for every Op, a lot of memory was being used to store intermediate values.

For more operations like Softmax, this quickly adds up. However, we can avoid a lot of this overhead by pre-computing the combined derivative. In the case of Softmax (see above), we could have built it entirely out of low level Tricycle operations and this does work. When you sit down and work out the derivative for softmax manually, it turns out to be pretty simple:

def backward(self, grad: Tensor) -> Tensor:
    xp = grad.xp

    inner = xp.sum(grad.array * self._out, axis=-1, keepdims=True)
    self._grad = self._out * (grad.array - inner)
    return to_tensor(
        self._grad,
        is_batched=grad.is_batched,
        requires_grad=grad.requires_grad,
    )

This kind of operation is a very common optimisation technique in deep learning called 'Operator Fusing'. This ends up being a big optimisation for tricycle because it lets us replace operations like MultiHeadSelfAttention, which would usually have 10s of intermediate values, with a single forward and backward function with a minimal set of intermediate values.

Other optimisations

While batching, using a GPU and fusing are the major optimisations, I'd like to provide some honourable mentions.

Inplace tensor updates

While probably obvious to many readers, updating tensors in-place rather than replacing them with a new tensor caused a big speed up.

Mathematical optimisations

Operations like CrossEntropy can be implemented by applying a softmax and then applying the cross entropy operation but, if you do a bit of algebra, you can do something called the log-sum-exp trick to simplify the expression and cut down on the computations needed.

Hardware optimisations

As mentioned above, the GPU computation was performed on an NVIDIA RTX 3090. Understandably, this gets quite hot when training (probably something to do with it being in my cupboard?) which can reduce performance due to thermal throttling. However, I found that by removing my computer case and placing a household fan on top, I get about 30% better performance.

Putting all of these things together, Tricycle can train a small language model on shakespeare in ~30 mins. Andrej Karpathy can do this in pytorch in around 7 minutes on my machine (with a like-for-like config) which, given that the entire Tricycle project is in python, means that Tricycle is surprisingly fast. That said, more work is needed to get the speed up.

Building a Language model

Now that we've got an automatic differentiation engine, we can start actually doing things with it. GPT 2 was arguably the first to use the modern stack for language generation. Even modern state of the art models like llama3 use the same basic architecture and training methods, with only a few small tweaks (e.g swapping layer norm with rms norm). Because I don't have access to many GPUs, we'll be training a smaller (49M parameter) version.

To build our GPT, we first need to understand its architecture:

There are a few important things to note in this diagram. First, the transformer is built out of 3 main pieces, the input block, a stack of transformer blocks and then an output layer. The input layer turns a list of tokens into a list of embeddings (each token gets projected to an embedding vector). The stack of transformer blocks process the embeddings, but leave their shape untouched and then the output layer converts each embedding into a vector that is the same length as the number of tokens in our vocabulary (more on this later).

This means that the transformer accepts a fixed number of tokens and predicts a fixed number of tokens. The number of tokens it accepts is usually called the context window but is sometimes called the block size or sequence length.

Also, it means that we can make our transformer bigger or smaller pretty easily by simply increasing the number of tokens in our context window, the size of our embeddings and the number of transformer blocks in our stack. (There is also the number of transformer heads but more on this later too).

Input block

We know the input block needs to take a list of tokens as an input and return a list of embeddings. We can do this with a dense layer. We can one-hot encode a token into a vector of 0s with a single 1 corresponding to the token id (e.g 2 -> [0,0,1,0,...,0]). Then we can pass this through a dense layer to convert it from a 1 x vocab_size vector to a 1 x embedding_size vector.

However, this is a very expensive operation. For each token, we need to do a multiplication by a vocab_size x embedding_size matrix. However, we can notice that the one-hot encoded vector is almost entirely 0's. If you go through the algebra, this means that the matrix multiplication is actually equivalent to simply returning a row from the weights matrix. That is, for token t, the output is the tth row in the matrix. Returning a single row from a matrix is dramatically faster than doing a matrix multiplication so we'll do that instead. We can wrap this logic up in a new layer: Embedding.

We aren't quite done with the input block however. Transformers perform better When they are given information about where a given token is in the context window (e.g is a token at the start, end or somewhere in the middle?). In the original transformer paper, this was done by with some sine waves but GPT-2 uses learned embeddings which are conceptually simpler. (Modern language models use rotary embeddings which are in development). When we pass a token through an embedding layer, we also pass the index of the token through a different embedding layer and then add the two embeddings together. This way, the embedding contains information about which token was passed into the model, as well as where it is in the context window.

Putting these operations together, we finally get our input block:

Transformer Block

The transformer block is the core of a transformer. It is built from two main pieces: an attention block and a multi-layer-perceptron block. Whenever we pass some data through one of these sub-blocks, we add whatever the sub-block outputs to the input to the block. This is called a residual layer (sometimes also called a skip layer). I think of transformers as having a "highway" that the embeddings pass along with each sub-block adding extra context. You can imagine lower blocks adding information intto the embeddings that are then read by blocks further along in the stack. Whether this mental model is helpful remains to be seen (and I'd love to be corrected if there is something I'm missing).

Gradients (derivatives) in deep learning models have a habit of rapidly increasing in value (exploding) or decreasing to 0 (vanishing) so it is important to frequently rescale embeddings throughout the model. You'll notice that the embeddings are normalised before being passed through each sub-block. In GPT-2, this is done with a layer norm.

Attention Block

If you have heard anything about transformers, it is probably that they use attention. This is certainly the most complex part of a transformer but, at a high level, its goal is pretty simple: let each embedding interact with the other embeddings. This "interaction" will be in the form of a matrix, called the attention matrix, that is n_tokens x n_tokens x embedding_dim. Each entry in the matrix is a vector that represents the interaction between two embeddings in the input.

Because this section gets a bit hairy it'll be helpful to see the goal we're heading towards:

The first thing we do is to pass the input embedding through a dense layer to make each embedding 3 times longer than it used to be. Then we split the resulting embedding into 3 separate pieces, unhelpfully called the key, query and value vectors. Because we projected each embedding before splitting, each of the new vectors is the same length as the original input. We won't use the value vector until later so we'll focus on the key and query vectors for now.

We could build our attention matrix by multiplying our key vector by our query vector and this does work. However, in the original transformer paper, they first split each query into several smaller chunks (that they call heads) that they compute attention matrices for individually and then recombine into a single attention matrix at the end. They claim this improves performance with a similar computational cost and I don't have the resources to figure out whether this is actually true. For computational efficiency, I've avoided explicitly splitting and recombining by doing everything inplace:

# key.shape = batch_size x n_tokens x embedding_dim
# query.shape = batch_size x n_tokens x embedding_dim

head_shape = (
    self.batch_size,
    self.n_tokens,  # number of tokens
    self.n_heads,  # number of heads
    self.head_size,  # embedding per head
)

# split into multiple heads
key = key.reshape(head_shape)
query = query.reshape(head_shape)

# reorder
key = xp.einsum("BTNH->BNTH", key)
query = xp.einsum("BTNH->BNTH", query)

# attend
self.divisor = sqrt(self.head_size)
attention = xp.einsum("BNIh, BNJh -> BNIJ", query, key)
attention = attention / self.divisor

I'd strongly recommend having a play around with the code here to get a feel for what these operations actually do.

Next, we need to digress slightly into how we train the model. To get our model to generate text, we'll train it by asking it to predict the next token in a sequence of tokens. Importantly, we do this for every token in the sequence: token 0 in the input is used to predict token 1 in the output etc. This means that the embeddings for earlier tokens can't be allowed to contain information about embeddings for later tokens. Otherwise, predicting the next token would be trivially easy for all but the final token.

Because we calculate the interaction between every token and every other token in the attention matrix, we end up sneaking information about later tokens into the attention for earlier tokens. To avoid this leakage, we apply a "mask" to the attention matrix. If you work it out, you find that the leakage happens entirely in the upper triangle of the attention matrix. We can remove this information by manually setting each of these values to -infinity.

Finally, we normalise the matrix by softmaxing each row, multiply the attention matrix by the value vector and reshape it to convert it back into the original n_tokens x embedding_dim shape we started with. For reasons that I'm unclear about, we pass this output through a dense layer and optionally apply dropout if we want to regularise our model.

And that's it. My implementation of attention (without the dense layers) is as follows:

def forward(self, tensor: Tensor):
    xp = tensor.xp

    assert tensor.is_batched

    # split the input into 3 pieces
    self._input = tensor
    query = tensor[:, :, : self.embedding_dim]
    key = tensor[:, :, self.embedding_dim : self.embedding_dim * 2]
    value = tensor[:, :, self.embedding_dim * 2 :]

    # Figure out how big everything is
    self.batch_size = key.array.shape[0]
    self.head_size = self.embedding_dim // self.n_heads
    self.n_tokens = key.shape[-2]
    head_shape = (
        self.batch_size,
        self.n_tokens,  # number of tokens
        self.n_heads,  # number of heads
        self.head_size,  # embedding per head
    )
    out_shape = (self.batch_size, self.n_tokens, self.embedding_dim)

    # reshape and reorder the heads
    key = key.array
    query = query.array
    value = value.array

    key = key.reshape(head_shape)
    query = query.reshape(head_shape)
    value = value.reshape(head_shape)

    key = xp.einsum("BTNH->BNTH", key)
    query = xp.einsum("BTNH->BNTH", query)
    value = xp.einsum("BTNH->BNTH", value)

    self._key = key
    self._query = query
    self._value = value

    # attend
    self.divisor = sqrt(self.head_size)
    attention = xp.einsum("BNIh, BNJh -> BNIJ", query, key)
    attention = attention / self.divisor

    # mask
    attention = xp.where(
        self.mask[:, : self.n_tokens, : self.n_tokens], -xp.inf, attention
    )

    # softmax
    exp = xp.exp(attention - xp.max(attention, axis=-1, keepdims=True))
    denominator = xp.sum(exp, axis=-1, keepdims=True)
    attention = exp / denominator

    # TODO: come up with a better name
    # smush the heads back together
    self._before_smush = attention
    attention = xp.einsum("BNTj, BNjH -> BTNH", attention, value)
    attention = attention.reshape(out_shape)

    result = to_tensor(attention, is_batched=True)
    result.back_fns = (self.backward,)
    result.args = (self._input,)
    return result

Again, if you want to really understand this, I'd strongly suggest playing around with the code to understand what each little piece does.

Splitting each vector into multiple heads make our variant of attention "multi-head". Applying a mask to hide future tokens makes our attention "causal" and splitting our input into 3 pieces that we then combine with each other makes our attention "self-attention". Putting this all together, the formal name for this variant of attention is "multi-head causal self attention". In Tricycle, I've called it MultiHeadSelfAttention.

MLP Block

Unlike the attention block, the MLP block is much simpler. While you can think of attention as letting different embedding vectors interact with each other, You can think of the MLP block as adding information to each embedding individually. First, we pass each embedding through a Dense layer that projects it into a bigger vector. This was chosen to be 4 times longer than the original vector in the GPT-2 paper so that's what we're using.

Next, we pass it through a non-linearity. This step is really important because if you skip this step, mathematically, your MLP block reduces to a single (very expensive) matrix multiplication and performance plummets. In GPT-2 we're using GeLU but I've added several other activation functions to Tricycle that you can try out if you're interested.

Finally, we project the output back down to its original size with another dense layer and optionally apply a dropout for regularisation.

Output

Finally, once we've embedded our tokens and passed them through a stack of transformer blocks, all that remains is to turn the embeddings back into tokens. We can do this by passing them through a dense layer to turn each embedding into a 1 x vocab_size vector. We can treat each of these outputs as a probability distribution over all tokens where larger numbers mean that the model thinks a token is more likely to come next and smaller numbers mean that the model thinks a token is less likely to come next.

What's Next?

Documentation
- Explain how to train a language model
- Explain the tokeniser
Code
- Rotary Embeddings
- Test RMS Norm
- Multi-GPU support
- Optimise and use the tokeniser
Experiments
- Try a language dataset rather than pure code
- Build a LLama style model
- Build a bigger langauge model (GPT-2 sized?)

Contact

Want to work together? You can reach me at: bclarkson-code@proton.me

bclarkson-code / Tricycle