Picograd - Fully connected neural networks in Python

What is this?

In the process of learning ML and PyTorch, I decided to try and write my own neural network package in Python. It has been (relatively) succesful.

Optimizing neural networks involves minimization of a cost function (the loss) from a very high dimensional space to $\mathbb{R}$ . Because of this, it would not be practical to compute hunderds of thousands of numerical derivatives, each requiring a separate forward pass of the network just for a single optimization step. Instead, we can use something called Reverse Mode Automatic Differentiation. You can think of it as repeatedly applying the chain to some very large compositions of functions.

How does it work?

Since what we are doing is computing the chain rule property at a bunch of nodes, we need a way to track how each node is transformed when an operation is applied to it. The picograd Tensor class is a wrapper around a numpy array with the following attributes and methods:

value: This is the value of the node, stored as a NumPy array
parents: A set containing all the parents of value
grad: The gradient at this node, initialzed to a NumPy array of zeros in the shape of value
_backward: The basic operation at this node
backward: A method which computes the backward pass from this node. This means we are computing all gradients with respect to this node

For example, lets have two values x=1 and y=2. If we add them together, we get z=x+y. In picograd, a new Tensor object for z is created, it has the following properties:

value: 3
parents: set(x,y)
grad: 0
_backward: A lambda function telling us how to update the gradients for x and y.

If we call z.backward(), the backward method is called to compute all gradients. But what about the gradient of z? For backpropogation, the node from which backward() is called is set to have a gradient of 1. Note that this means we can only use this for scalar valued functions.

Calling z.backward() builds the computational graph of all parent nodes starting at z. From this graph, we go through it in reversed topologically sorted order and apply the _backward() function for each node.

What actually works?

Any operation we would like to apply needs to have a corresponding backward pass method implemented. When dealing with functions operating on vectors and not just scalars, we implement these in the form of a Vector Jacobian Product or VJP. VJPs give us a more compact way to represent the gradient updates for a node. (EXPAND)

Operations for neural networks can be broken down into the following categories:

Elementwise operations: ReLU,Sigmoid,exp, softmax
Non-broadcasted binary operations: dot, matmul
Broadcasted binary operations: +,-,*
Reduction operations: mean, sum, max

Using the above categories, I have implemented the following in picograd:

ReLU,LeakyReLU,Sigmoid, Tanh, Softmax, LogSoftmax, Dropout, Log, Exp
Dot,Matmul
Add,Sub,Pow, (elementwise) Mul, Div
mean,sum, max

With these operations, you can construct all the pieces required to create a fully connected neural network. Add in an optimizer (SGD and Adam implemented) and you train the network! See examples/train_MNIST.ipynb for a neural network trained on MNIST. To run tests, see test/test_tensor.py.

A note on broadcasting operations

The backward pass for broadcasted operations are a bit subtle. Say that we compute a linear pass on a set of inputs with a batch size greater than 1: output = Wx + b. In this computation, the bias vector is broadcasted to match the batch size dimension. If we naively compute the backward gradient now, our gradient will have the wrong size! Instead, what we need to do is explicitly compute the backward pass for 'broadcasting'. I found a very helpful overview of it here. To summarize it, we define an operation, F, which represents the broadcasting explicitly. This would make our above linear pass: output = Wx+F(b). We can therefore compute the VJP for this (it corresponds to summation along the batch axes).

Example

from picograd.Tensor import Tensor

x = Tensor.eye(3)
y = Tensor([[2.0,0,-2.0]])
z = y.dot(x).sum()
z.backward()

print(x.grad)  # dz/dx
print(y.grad)  # dz/dy

How would you improve on this?

You can do a fair amount with just the operations I have implemented. There are a few different directions this project can take:

1. Adding more operations

If you wanted to train any sort of vision model, you would need to implement a 2d convolutional operation, as well as average and max pooling operations. If you wanted to train a Transformer, you would need to implement a LayerNorm and a Concat operation (Ok, well you could just construct LayerNorm and BatchNorm as compositions of the basic operations we have but this would be like computing the backward pass of a sigmoid as a composition of all the basic functions instead of as dout*sigmdoid(z)(1-sigmoid(z)).

Aside from this, adding more loss functions could be helpful. Currently only Negative Log Likelihood loss and Mean-Squared Error loss are implemented.

2. Adding support for hardware acceleration

In its current state, picograd only works on CPU. Training the MNIST model in the examples takes around 10-15 minutes. Training the exact same model on a GPU in PyTorch takes around 30 seconds. Yikes! Given that I don't have any background in C/C++, which is required for adding OpenCL or CUDA support, this will be a difficult task.

3. Code Reformatting/MiniTorch

As I was writing this up, I came across MiniTorch, which describes itself as, "...a pure Python re-implementation of the Torch API designed to be simple, easy-to-read, tested, and incremental." In the future I plan on working through this and reformatting picograd to use the same structure as PyTorch.

References:

tinygrad and micrograd are two other implementations of neural networks in Python. They were very useful to have as references when working on this project.

fattorib / picograd