Matrix Calculus for Machine Learning and Beyond

This is the course page for an 18.S096 Special Subject in Mathematics at MIT taught in January 2024 (IAP) by Professors Alan Edelman and Steven G. Johnson.

For the previous version of this course, see Matrix Calculus in IAP 2023 (OCW) on OpenCourseWare (also on github, with videos on YouTube). See also Matrix Calculus in IAP 2022 (OCW) (also on github).

Lectures: MWF time 11am–1pm, ~~Jan 16~~ Jan 17–Feb 2 in room ~~2-131~~ 2-105; lecture recordings posted on Panopto (MIT only). 3 units, 2 problem sets due Jan 26 and Feb 2 — submitted electronically via Canvas, no exams. TA/grader: Fedir Yudin.

Course Notes: Draft notes from IAP 2023. Other materials to be posted.

Piazza forum: Online discussions at this piazza link.

Description:

We all know that calculus courses such as 18.01 and 18.02 are univariate and vector calculus, respectively. Modern applications such as machine learning and large-scale optimization require the next big step, "matrix calculus" and calculus on arbitrary vector spaces.

This class covers a coherent approach to matrix calculus showing techniques that allow you to think of a matrix holistically (not just as an array of scalars), generalize and compute derivatives of important matrix factorizations and many other complicated-looking operations, and understand how differentiation formulas must be re-imagined in large-scale computing. We will discuss reverse/adjoint/backpropagation differentiation, custom vector-Jacobian products, and how modern automatic differentiation is more computer science than calculus (it is neither symbolic formulas nor finite differences).

Prerequisites: Linear Algebra such as 18.06 and multivariate calculus such as 18.02.

Course will involve simple numerical computations using the Julia language. Ideally install it on your own computer following these instructions, but as a fallback you can run it in the cloud here:

Topics:

Here are some of the planned topics:

Derivatives as linear operators and linear approximation on arbitrary vector spaces: beyond gradients and Jacobians.
Derivatives of functions with matrix inputs and/or outputs (e.g. matrix inverses and determinants). Kronecker products and matrix "vectorization".
Derivatives of matrix factorizations (e.g. eigenvalues/SVD) and derivatives with constraints (e.g. orthogonal matrices).
Multidimensional chain rules, and the significance of right-to-left ("forward") vs. left-to-right ("reverse") composition. Chain rules on computational graphs (e.g. neural networks).
Forward- and reverse-mode manual and automatic multivariate differentiation.
Adjoint methods (vJp/pullback rules) for derivatives of solutions of linear, nonlinear, and differential equations.
Application to nonlinear root-finding and optimization. Multidimensional Newton and steepest–descent methods.
Applications in engineering/scientific optimization and machine learning.
Second derivatives, Hessian matrices, quadratic approximations, and quasi-Newton methods.

Lecture 1 (Jan 17)

part 1: overview (slides)
part 2: derivatives as linear operators
video (MIT only)

Re-thinking derivatives as linear operators: f(x+dx)-f(x)=df=f′(x)[dx] — f′ is the linear operator that gives the change df in the output from a "tiny" change dx in the inputs, to first order in dx (i.e. dropping higher-order terms). When we have a vector function f(x)∈ℝᵐ of vector inputs x∈ℝⁿ, then f'(x) is a linear operator that takes n inputs to m outputs, which we can think of as an m×n matrix called the Jacobian matrix (typically covered only superficially in 18.02).

In the same way, we can define derivatives of matrix-valued operators as linear operators on matrices. For example, f(X)=X² gives f'(X)[dX] = X dX + dX X. Or f(X) = X⁻¹ gives f'(X)[dX] = –X⁻¹ dX X⁻¹. These are perfectly good linear operators acting on matrices dX, even though they are not written in the form (Jacobian matrix)×(column vector)! (We could rewrite them in the latter form by reshaping the inputs dX and the outputs df into column vectors, more formally by choosing a basis, and we will later cover how this process can be made more elegant using Kronecker products. But for the most part it is neither necessary nor desirable to express all linear operators as Jacobian matrices in this way.)

Reviewed the (easy) derivations of the sum rule d(f+g)=df+dg and the product rule d(fg) = (df)g+f(dg), directly from the definition of f(x+dx)-f(x)=df=f′(x)[dx], dropping higher-order terms.

Further reading: Draft Course Notes (link above), sections 1, 2.1, 2.2, 2.4, 3.1, 3.3. matrixcalculus.org (linked in the slides) is a fun site to play with derivatives of matrix and vector functions. The Matrix Cookbook has a lot of formulas for these derivatives, but no derivations. Some notes on vector and matrix differentiation were posted for 6.S087 from IAP 2021.

Further reading (fancier math): the perspective of derivatives as linear operators is sometimes called a Fréchet derivative and you can find lots of very abstract (what I'm calling "fancy") presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces. The "little-o notation" o(δx) we're using here for "infinitesimal asymptotics" is closely related to the asymptotic notation used in computer science, but in computer science people are typically taking the limit as the argument (often called "n") becomes very large instead of very small.

Lecture 2 (Jan 19)

part 1: more examples, gradients of scalar-valued functions, the chain rule, and forward/reverse mode differentiation (slides from lecture 1). Course notes: sections 2.3, 3.2, 7.1, 8.3.
part 2: matrix Jacobians via vectorization; notes: 2×2 Matrix Jacobians (html) (pluto notebook source code). Course notes: sections 4, 5.
pset 1, due Friday, Jan 26.
video (MIT only)

Began with a review and a discussion of different viewpoints of linear operators. Although linear operators (in finite dimensions) can always be represented with matrices, we don't have to use a matrix representation, and a common confusion (or notational shorthand) is to conflate this representation of the operator with the operator itself.

Worked out some more examples of derivatives, from the slides.

When we have a scalar function f(x)∈ℝ of vector inputs x∈ℝⁿ, then this gives us a "row vector" f′(x) since f′(x)dx is a scalar, which we interpret as the transpose of the gradient ∇f (which we call a "column" vector), i.e. df = (∇f)⋅dx = (∇f)ᵀdx. We can generalize gradients to scalar functions f(x) for x in arbitrary vector spaces x ∈ V. The key thing is that we need not just a vector space, but an inner product x⋅y (a "dot product", also denoted ⟨x,y⟩ or ⟨x|y⟩); V is then formally called a Hilbert space. Then, for any scalar function, since df=f'(x)[dx] is a linear operator mapping dx∈V to scalars df∈ℝ (a "linear form"), it turns out that it must be a dot product of dx with "something", and we call that "something" the gradient! That is, once we define a dot product, then for any scalar function f(x) we can define ∇f by f'(x)[dx]=∇f⋅dx. So ∇f is always something with the same "shape" as x (the steepest-ascent direction).

Discussed the chain rule for f(g(x)) (f'(x)=g'(h(x))h'(x), where this is a composition of two linear operations, performing h' then g' — g'h' ≠ h'g'!). For functions from vectors to vectors, the chain rule is simply the product of Jacobians. Moreover, as soon as you compose 3 or more functions, it can a make a huge difference whether you multiply the Jacobians from left-to-right ("reverse-mode", or "backpropagation", or "adjoint differentiation") or right-to-left ("forward-mode"). Showed, for example, that if you have many inputs but a single output (as is common in machine learning and other types of optimization problem), that it is vastly more efficient to multiply left-to-right than right-to-left, and such "backpropagation algorithms" are a key factor in the practicality of large-scale optimization. Gave an example (see slides) of a scalar-valued function f(A(p)⁻¹b), in which left-to-right order allows us to compute ∇f using only two linear solves: one with A and a second "adjoint" solve involving Aᵀ. (We will talk more about such "adjoint methods" later.)

Part 2: Another viewpoint on derivatives of matrix-valued functions via their relationship to the "Jacobian matrix" picture. Converting f′(A) to a conventional "Jacobian matrix" in such cases requires converting matrices A into column vectors vec(A), a process called "vectorization" of the matrix (by a common convention: simply stacking the matrix by columns). Linear operators like f′(A)[dA]=AdA+dAA can then be expressed as "ordinary" matrices via Kronecker products.

Further reading (gradients): A fancy name for a row vector is a "covector" or linear form, and the fancy version of the relationship between row and column vectors is the Riesz representation theorem, but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.

Further reading (chain rule): The terms "forward-mode" and "reverse-mode" differentiation are most prevalent in automatic differentiation (AD), which will will cover later in this course. You can find many, many articles online about backpropagation in neural networks. There are many other versions of this, e.g. in differential geometry the derivative linear operator (multiplying Jacobians and perturbations dx right-to-left) is called a pushforward, whereas multiplying a gradient row vector (covector) by a Jacobian left-to-right is called a pullback. This video on the principles of AD in Julia by Dr. Mohamed Tarek also starts with a similar left-to-right (reverse) vs right-to-left (forward) viewpoint and goes into how it translates to Julia code, and how you define custom chain-rule steps for Julia AD. In other fields, "reverse mode" is sometimes called an "adjoint method": see the notes on adjoint methods and slides from 18.335 (video).

Further reading (vectorization): When you "flatten" a matrix A by stacking its columns into a single vector, the result is called vec(A), and many important linear operations on matrices can be expressed as Kronecker products.

Lecture 3 (Jan 22)

part 1: forward-mode automatic differentiation via dual numbers (notes)
part 2: finite-differences (notes)
video (MIT only)

Further reading (part 1): Course notes, section 10.1. Googling "automatic differentiation" will turn up many, many resources — this is a huge practical field these days. ForwardDiff.jl (described detail by this paper) in Julia uses dual number arithmetic similar to lecture to compute derivatives; see also this AMS review article, or google "dual number automatic differentiation" for many other reviews. Adrian Hill posted some nice lecture notes on automatic differentiation (Julia-based) for an ML course at TU Berlin (Summer 2023). TaylorDiff.jl extends this to higher-order derivatives.

Further reading (part 2): Course notes, section 6. There is a lot of information online on finite difference approximations, these 18.303 notes, or Section 5.7 of Numerical Recipes. The Julia FiniteDifferences.jl package provides lots of algorithms to compute finite-difference approximations; a particularly robust and powerful way to obtain high accuracy is to employ Richardson extrapolation to smaller and smaller δx. If you make δx too small, the finite precision (#digits) of floating-point arithmetic leads to catastrophic cancellation errors.

Optional Julia Tutorial: Monday Jan 22 @ 5pm via Zoom

Virtually via Zoom. Recording (MIT only).

A basic overview of the Julia programming environment for numerical computations that we will use in 18.06 for simple computational exploration. This (Zoom-based) tutorial will cover what Julia is and the basics of interaction, scalar/vector/matrix arithmetic, and plotting — we'll be using it as just a "fancy calculator" and no "real programming" will be required.

Tutorial materials (and links to other resources)

If possible, try to install Julia on your laptop beforehand using the instructions at the above link. Failing that, you can run Julia in the cloud (see instructions above).

Lecture 4 (Jan 24)

Handwritten notes
generalized gradients and inner products — see handwritten notes
norms and derivatives: why norms of the input and output vector spaces are needed to define a derivative - see handwritten notes
steepest-ascent: why the gradient is the steepest-ascent direction for any inner product, using the corresponding norm. example application: steepest-ascent optimization when x components have different units requires a weighted inner product and a corresponding norm - see handwritten notes
slides on nonlinear root-finding, optimization, and adjoint-method differentiation slides - talked about optimization only
video (MIT only)

Generalizing gradients to scalar functions f(x) for x in arbitrary vector spaces x ∈ V. The key thing is that we need not just a vector space, but an inner product x⋅y (a "dot product", also denoted ⟨x,y⟩ or ⟨x|y⟩); V is then formally called a Hilbert space. Then, for any scalar function, since df=f'(x)[dx] is a linear operator mapping dx∈V to scalars df∈ℝ (a "linear form"), it turns out that it must be a dot product of dx with "something", and we call that "something" the gradient! That is, once we define a dot product, then for any scalar function f(x) we can define ∇f by f'(x)[dx]=∇f⋅dx. So ∇f is always something with the same "shape" as x (the steepest-ascent direction).

Talked about the general requirements for an inner product: linearity, positivity, and (conjugate) symmetry (and also mentioned the Cauchy–Schwarz inequality, which follows from these properties). Gave some examples of inner products, such as the familiar Euclidean inner product xᵀy or a weighted inner product. Defined the most obvious inner product of m×n matrices: the Frobenius inner product A⋅B=sum(A .* B)=trace(AᵀB)=vec(A)ᵀvec(B), the sum of the products of the matrix entries. This also gives us the "Frobenius norm" ‖A‖²=A⋅A=trace(AᵀA)=‖vec(A)‖², the square root of the sum of the squares of the entries. Using this, we can now take the derivatives of various scalar functions of matrices, e.g. we considered

f(A)=‖A‖ ⥰ ∇f = A/‖A‖
f(A)=xᵀAy ⥰ ∇f = xyᵀ (for constant x, y)
f(A)=det(A) ⥰ ∇f = det(A)(A⁻¹)ᵀ = adjugate(A)ᵀ: we will prove this later

Also talked about the definition of a norm (which can be obtained from an inner product if you have one, but can also be defined by itself), and why a norm is necessary to define a derivative: it is embedded in the definition of what a higher-order term o(δx) means. (Although there are many possible norms, in finite dimensions all norms are equivalent up to constant factors, so the definition of a derivative does not depend on the choice of norm.)

Made precise and derived (with the help of Cauchy–Schwarz) the well known fact that ∇f is the steepest-ascent direction, for any scalar-valued function on a vector space with an inner product (any Hilbert space), in the norm corresponding to that inner product. That is, if you take a step δx with a fixed length ‖δx‖=s, the greatest increase in f(x) to first order is found in a direction parallel to ∇f.

Further reading: Course notes, section 7 and section 8.2. SGJ gave another overview of optimization in 18.335 (video). There are many textbooks on nonlinear optimization algorithms of various sorts, including specialized books on convex optimization, derivative-free optimization, etcetera.

Lecture 5 (Jan 26)

More on handling units - a change of variables (to nondimensionalize the problem) is equivalent (for steepest descent) to a nondimensionalization of the inner-product/norm, but the former is typically easier for use with off-the-shelf optimization software. See end of handwritten notes from last lecture.
Adjoint/reverse-mode gradients for linear and nonlinear equations - see slides on nonlinear root-finding, optimization, and adjoint-method differentiation. Example from IAP 2023 pset 2.
Reverse-mode gradients for neural networks: handwritten backpropagation notes
The gradient of the determinant is ∇(det A) = det(A)A⁻ᵀ
video (MIT only)
pset 1 solutions and computational notebook
pset 2 (due midnight Feb 2)

Further reading on adjoint methods Course notes, section 8.3. A useful review of topology-optimization methods in engineering (where "every pixel/voxel" of a structure is a degree of freedom) can be found in Sigmund and Maute (2013). See the notes on adjoint methods and slides from 18.335 (video).

Further reading on backpropagation for NNs: Strang (2019) section VII.3 and 18.065 OCW lecture 27. You can find many, many articles online about backpropagation in neural networks. Backpropagation for neural networks is closely related to backpropagation/adjoint methods for recurrence relations (course notes), and on computational graphs (blog post). We will return to computational graphs in a future lecture.

Further reading on d(det A): Course notes, section 9. There are lots of discussions of the derivative of a determinant online, involving the "adjugate" matrix det(A)A⁻¹. Not as well documented is that the gradient of the determinant is the cofactor matrix widely used for the Laplace expansion of a determinant. The formula for the derivative of log(det A) is also nice, and logs of determinants appear in surprisingly many applications (from statistics to quantum field theory). The Matrix Cookbook contains many of these formulas, but no derivations. A nice application of d(det(A)) is solving for eigenvalues λ by applying Newton's method to det(A-λI)=0, and more generally one can solve det(M(λ))=0 for any function Μ(λ) — the resulting roots λ are called nonlinear eigenvalues (if M is nonlinear in λ), and one can apply Newton's method using the determinant-derivative formula here.

Lecture 6 (Jan 29)

part 1: second derivatives, bilinear forms, and Hessian matrices; applications of Hessians to optimization - course notes, section 14 + handwritten notes
part 2: calculus of variations - course notes, section 12 + handwritten notes
video (MIT only)

Further reading (part 1):

Bilinear forms are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a symmetric bilinear forms. This is closely related to a quadratic form, which is just what we get by plugging in the same vector twice, e.g. the f''(x)[δx,δx]/2 that appears in quadratic approximations for f(x+δx) is a quadratic form. The most familar multivariate version of f''(x) is the Hessian matrix; Khan academy has an elementary introduction to quadratic approximation
Positive-definite Hessian matrices, or more generally definite quadratic forms f″, appear at extrema (f′=0) of scalar-valued functions f(x) that are local minima; there a lot more formal treatments of the same idea, and conversely Khan academy has the simple 2-variable version where you can check the sign of the 2×2 eigenvalues just by looking at the determinant and a single entry (or the trace). There's a nice stackexchange discussion on why an ill-conditioned Hessian tends to make steepest descent converge slowly; some Toronto course notes on the topic may also be helpful.
See e.g. these Stanford notes on sequential quadratic optimization using trust regions (sec. 2.2). See 18.335 notes on BFGS quasi-Newton methods (also video). The fact that a quadratic optimization problem in a sphere has strong duality and hence is efficiently solvable is discussed in section 5.2.4 of the Convex Optimization book. There has been a lot of work on automatic Hessian computation, but for large-scale problems you can ultimately only compute Hessian–vector products efficiently in general, which are equivalent to a directional derivative of the gradient, and can be used e.g. for Newton–Krylov methods.

Further reading (part 2): There are many resources on the "calculus of variations", which refers to derivatives of f(u)=∫F(u,u′,…)dx for functions u(x), but we saw that it is essentially just a special case of our general rule df=f(u+du)-f(u)=f′(u)[du]=∇f⋅du when du lies in a vector space of functions. Setting ∇f to find an extremum of f(u) yields an Euler–Lagrange equation, the most famous examples of which are probably Lagrangian mechanics and also the Brachistochrone problem, but it also shows up in many other contexts such as optimal control. A very readable textbook on the subject is Calculus of Variations by Gelfand and Fomin.

Lecture 7 (Jan 31)

part 1: derivatives with constraints, derivatives of eigenproblems (html) (julia source)
part 2: complex numbers and CR calculus - handwritten notes
video (MIT only)

Further reading (part 1): Computing derivatives on curved surfaces ("manifolds") is closely related to tangent spaces in differential geometry. The effect of constraints can also be expressed in terms of Lagrange multipliers, which are useful in expressing optimization problems with constraints (see also chapter 5 of Convex Optimization by Boyd and Vandenberghe). In physics, first and second derivatives of eigenvalues and first derivatives of eigenvectors are often presented as part of "time-independent" perturbation theory in quantum mechanics, or as the Hellmann–Feynmann theorem for the case of dλ. The derivative of an eigenvector involves all of the other eigenvectors, but a much simpler "vector–Jacobian product" (involving only a single eigenvector and eigenvalue) can be obtained from left-to-right differentiation of a scalar function of an eigenvector, as reviewed in the 18.335 notes on adjoint methods.

When differentiating eigenvalues λ of matrices A(x), a complication arises at eigenvalue crossings (multiplicity k > 1), where in general the eigenvalues (and eigenvectors) cease to be differentiable. (More generally, this problem arises for any implicit function with a repeated root.) In this case, one option is to use an expanded definition of sensitivity analysis called a generalized gradient (a k×k matrix-valued linear operator G(x)[dx] whose eigenvalues are the perturbations dλ); see e.g. Cox (1995), Seyranian et al. (1994), and Stechlinski (2022). (Physicists call this idea degenerate perturbation theory.) A recent formulation of similar ideas is called a lexicographic directional derivative; see Nesterov (2005) and Barton et al (2017). Sometimes, optimization problems involving eigenvalues can be reformulated to avoid this difficulty by using SDP constraints (Men et al., 2014). For a defective matrix the situation is worse: even the generalized derivatives blow up, because dλ is proportional to the square root of the perturbation ‖dA‖.

Further reading (part 2): A well-known reference on CR calculus can be found in the UCSD notes The Complex Gradient Operator and the CR-Calculus by Ken Kreutz-Delgado (2009). These are also called Wirtinger derivatives. There is some support for this in automatic differentiations packages, e.g. see the documentation on complex functions in ChainRules.jl or complex functions in JAX. The special case of "holomorphic" / "analytic" functions where we have an "ordinary" derivative (= linear operator on dz) is the main topic of complex analysis, for which there are many resources (textbooks, online tutorials, and classes like 18.04).

Lecture 8 (Feb 2)

part 1: derivatives of random functions (guest lecture by Gaurav Arya)
part 2: forward and reverse-mode automatic differentiation on computational graphs
video (MIT only): coming soon
pset 2 solutions and Julia notebook

Further reading (part 1): Course notes, chapter 13.

The idea of computing gradients of programs with a sampleable random output is called Monte-Carlo Gradient Estimation; the link leads to a nice survey.
This StackOverflow answer gives a concise, worked-out example of the reparameterization trick applied to a toy program.
The frontier of simulation-based inference gives an overview of stochastic simulations across many domains of science, and discusses attempts to deal with the fact that it is much easier to sample them than to exactly compute a "likelihood".
Auto-encoding variational bayes introduces the variational autoencoder, and introduces the reparameterization trick for use in training it.
Automatic differentiation of programs with discrete randomness describes an approach for generalizing derivatives of continuous random functions based on the "reparameterization trick" to discrete random functions. StochasticAD.jl is an associated code package that implements the stochastic triples we played with at the end of class.

Further reading (part 2): Course notes, section 10.2. See Prof. Edelman's poster about backpropagation on graphs, this blog post on calculus on computational graphs for a gentle review, and these Columbia course notes for a more formal approach. Implementing automatic reverse-mode AD is much more complicated than defining a new number type, unfortunately, and involves a lot more intricacies of compiler technology. See also Chris Rackauckas's blog post on tradeoffs in AD, and Chris's discussion post on AD limitations.

mitmath / matrixcalc