dfdx / Espresso.jl

Expression transformation package

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Differentiation rules for tensors

dfdx opened this issue · comments

One of the simplest and quite popular example of where we need tensors is a derivative of a matrix-matrix multiplication:

C = A*B

If all three are matrices, then derivative of C w.r.t. A is actually a 4-dimensional tensor with elements defines as:

dC[i,j]/dA[p,q] = B[q, j]

Even for this simple case it's unclear how to define symbolic expression corresponding to the whole derivative dC/dA, and in more general case we may want to be able to differentiate operations over tensors with arbitrary many dimensions.

Some libraries like Theano and ReverseDiffSource.jl solve it by restricting output variable to be scalar, but systems like autograd seem to be able to handle multidimensional tensor derivatives, so we may take them as a source for inspiration.

Hi dfdx,

I just stumbled across your work and am really happy with what's described in the readme. I'll try it as soon as I get back to symbolic manipulation. I'll also have time to help you out on this issue when I get back from holiday, if you'd like. Just to get this straight, we're talking about tensor components, right? Is there any expression simplification required? That would require some index permutation rules and the notion of dimension-wise symmetry or asymmetry.

Best,
Robert

@rleegates Thanks, help with this issue is especially appreciated, because my mind isn't yet fully compatible with higher-rank tensor operations :) Basically, I'm trying to come up with an idea how differentiation on tensors may look like in symbolic expression. Here's an example. In diff_rules.jl I have a rule for differentiating x * y where x and y are numbers:

@diff_rule *(x::Number, y::Number) 1 y

So the derivative of a product of 2 numbers w.r.t. its first argument is just its second argument. This trick also works for dot product of 2 vectors:

@diff_rule vecdot(x::AbstractVector, y::AbstractVector) 1 y

But even for matrix-vector multiplication the derivative is a tensor of rank 3 that to my knowledge doesn't have "nice" symbolic representation in Julia.

In theory, we could write something like:

@diff_rule *(x::Matrix, y::Vector) 1 begin
    d = zeros(size(x)..., y)
    for (i, j, k) in tensorindexes(x, y)
        d[i, j, k] = component_derivative(x, y, i, k, j)
    end
    d
end

I don't have recipes for tensorindexes and component_derivative yet, and the whole code would be terribly inefficient, but let it be. However, on the next step we would need to symbolically multiply this huge expression by the next partial derivative, producing even scarier expression. In addition, current simplification framework (see simplify.jl) is mostly useless in this case.

My assumption is that there's some set of tensor operations that allows to write something like:

@diff_rule *(x::Matrix, y::Vector) 1 some_fancy_tensor_operation(x, y)

Which looks much better and easier to simplify on later stages. Note that simplification here is important not only because final expression looks better, but also because it allows to add tensor-specific optimization. For example, if we have expression:

y = sum(W*x)

where W is a matrix and x is a vector, we could rewrite partial derivatives in a way that doesn't produce huge rank-3 derivative dy/dW replacing it with already summed values. (I hope this explanation is not too messy).

Anyway, I'm currently in a slow process of getting into tensor algebra, so if you can suggest good reading on the topic or possible solutions for the problems above, I'll be very glad to hear it.

This might be useful for us (just not to forget): https://github.com/Jutho/TensorOperations.jl

No, your explanation isn't messy at all. Here's what we need to keep in mind: we will need to confine ourselves to expressions in index notation, where all components are with respect to the canonical orthonormal basis in R^n. So the tensor really isn't much of a tensor anymore, it's rather just a multidimensional array of basis coefficients. This simplifies much of what we need to be doing, as we don't need any information on the manifold (i.e. its tangent bundle) with respect to which the local form of the tensor field is defined. I think this is fine for the moment.

The really messy part is the simplification work that would need to be implemented. For example one needs rules to catch things like symm(A) : skew(B) = 0 and to generalize these notions to higher order tensors. This requires attaching certain properties to the tensorial objects, and therefore this topic lies outside of pure expression parsing. There will also be a need to canonicalize tensor expressions; there are some algorithms for this, which rely on directed acyclic graph structures. The question is if this is really the way to go here. Usually, where it gets too complicated, there's a simpler solution which requires more theoretical insight. One very cheap solution of course is stochastic simplification, where we throw random inputs at subexpressions to test if they evaluate to the same output. However, this can be error-prone. If one sticks with actual symbolic simplification, this would first require an expression-matching machinery which is able to compute the relative permutation of the "matched" tensor indices with respect to the "search" expression.

Another way, which might be simpler is to actually compute the result of a tensor expression symbolically, then differentiate each component of the result with respect to each component of the desired free variable. This of course then results in a multidimensional array of symbolic expressions, which can then be simplified component-wise. There is no tensor expression simplification involved, because all we'd be doing is looking at the components. However, the output is not really human-readable as everything will be mangled into a single array instead of a resulting tensor expression. However, it may be more efficient, depending on how well the scalar simplification does its task.

Let me just illustrate the difference in two-dimensions:

If we treat tensors symbolically:
diff(:(A[i,j,k,l]*B[k,l]), B[m,n]) results in :(A[i,j,k,l]*δ[k,m]*δ[l,n]) and is simplified to :(A[i,j,m,n])

If we treat only components symbolically:
diff(:(A[i,j,k,l]*B[k,l]), B[m,n]) is reparsed into diff(:(A[i,j,1,1]*B[1,1] + A[i,j,1,2]*B[1,2] + A[i,j,2,1]*B[2,1] + A[i,j,2,2]*B[2,2]), B[m,n]) which results in :(A[i,j,1,1]*1 + A[i,j,1,2]*0 + A[i,j,2,1]*0 + A[i,j,2,2]*0) for m, n = 1, 1. This is then simplified to :(A[i,j,1,1]). All other components (m,n) are computed analogously, giving an array of expressions [:(A[i,j,$m,$n]) for m=1:2,n=1:2] (or a variant thereof).

This of course then results in a multidimensional array of symbolic expressions, which can then be simplified component-wise.

As far as I understand it should be pretty straightforward to transform single tensor expression into an array of component expressions and perform simplification on them. Though, I'm not sure if reverse transformation - from array of components into a single tensor expression - is also feasible.

diff(:(A[i,j,k,l]*B[k,l]), B[m,n]) results in :(A[i,j,k,l]*δ[k,m]*δ[l,n]) and is simplified to :(A[i,j,m,n])

Ah, I feel like uneducated dilettante, but could you please some references to the notation in use? I think I've got your idea, but I'd like to have good understanding of formal rules as well.

we will need to confine ourselves to expressions in index notation, where all components are with respect to the canonical orthonormal basis in R^n.

IIUC, this limits us to derivatives w.r.t. to vector-valued variables, right? And what about derivatives w.r.t. to matrices (e.g. d(A*B)/dB)?

As far as I understand it should be pretty straightforward to transform single tensor expression into an array of component expressions and perform simplification on them. Though, I'm not sure if reverse transformation - from array of components into a single tensor expression - is also feasible.

No I don't think that the reverse is possible in all cases.

Ah, I feel like uneducated dilettante, but could you please some references to the notation in use? I think I've got your idea, but I'd like to have good understanding of formal rules as well.

Don't! This is where we all started out at some point. The notation I am using is Einstein's summation convention, the δ is the Kronecker delta, i.e. the identity matrix. The two Kronecker deltas together form the fourth-order identity tensor which results when we compute the derivative of a second-order tensor w.r.t. to itself.

IIUC, this limits us to derivatives w.r.t. to vector-valued variables, right? And what about derivatives w.r.t. to matrices (e.g. d(A*B)/dB)?

No, this doesn't limit us at all to vector-valued variables. What it does limit is the treatment of tensors as coordinate-independent objects. When defined with respect to the canonical basis of R^n, tensors basically are just matrices of components, when defined with respect to a curvilinear coordinate system, however, the coefficients might stay constant for all points x on the manifold, however the tangent basis vectors can change w.r.t. a reference space, while following the coordinate lines.

Just wanted to chime in and say that I am lurking around and think this is all great! I wish I could make more useful comments, but I am afraid this is a little above my current skillset

@Evizero In fact, it's not as hard as it looks at first glance. Here's what I found about tensors and what I'm currently working, just to give you an idea of the progress.

  1. In physics, tensors are particularly useful for keeping values (like scalars and vectors) invariant during transformations between different bases (e.g. when rotating axes), and to achieve it complicated notions like covariance, contravariance, etc. are introduced. But for machine learning we normally don't change basis (i.e. set of features) or change it between 2 orthogonal bases (as in PCA), and thus can indeed treat tensors just as multidimensional arrays (thanks @rleegates for clarifying this).
  2. One very convenient thing that tensor algebra uses is an Einstein notation that allows to describe how components of an output tensor depend on components on inputs. E.g. we could describe matrix-by-matrix multiplication as C[i, j] = A[i, k] * B[k, j] and infer that we sum over k (because it occurs twice) and repeat it for each i and j (since they occur only once, see 1 for more detailed description of Einstein notation).
  3. Since every entry in Einstein notation is just a number, we can use use standard differentiation rules. In example above we know that, dC[i,j] / dA[i, k] is just B[k, j].

Of course, there are still many questions to figure out. For example, I don't think forcing people write their function in Einstein notation is a good idea (although we should definitely allow it). Instead, I believe we need to automatically convert expressions like:

y = W*x + b

to

y[i] = W[i, j] * x[j] + b[i]

differentiate and convert back when possible. This way if a user of the library has tensor inputs and outputs of rank up to 2 (i.e. scalar - rank-0, vector - rank-1 or matrix - rand-2), he or she should never even see index notation, regardless of a rank of intermediate tensors.

I'll post back from time to time to keep people informed (and to be corrected when I do stupid things). For those interested in tensors themselves (not just for machine learning purposes) I recommend these 2 papers:

  1. Introduction to Tensor Calculus
  2. A Gentle Introduction to Tensors

A couple of tasks that need to be implemented, just to keep track of the progress:

  • Update parse! in rdiff.jl to support index notation.
  • Update evaluate! in rdiff.jl to support index notation.
  • Implement differentiation for tensor components.
  • Transform matrix expressions to Einstein notation and back (when possible). Conversion to Einstein notation is mostly done (see einstein.jl in branch tensor-diff), but it's worth to check it once again. Backward conversion isn't implemented yet.
  • Replace expression expansion with expression composition, e.g. don't expand C = A + B; D = C + 1 into D = A + B + 1, because it becomes harder to handle it using index notation.
  • Think out trace / inner contraction, e.g. how to handle A[i, j, i] while parsing and differentiating.

One thing that I try to keep in my head is a possibility to generate code for different systems, e.g. make up BLAS expressions (call it blassify?) or code for GPU. Though, I haven't really put much effort in this direction yet, so feel free to propose fresh ideas.

Concerning GPU: maybe one way to go would be to try to use functions that ArrayFire.jl understands. In such a case the GPU support would more or less come for free.

@ahwillia is working on some tensor stuff

Thanks for the ping. Here are some potentially helpful repos:

https://github.com/ahwillia/Einsum.jl
https://github.com/Jutho/TensorOperations.jl

I'd like to make more... not enough bandwidth at the moment.

@ahwillia Since you have worked a lot with Einstein notation, I think you can confirm or deny one of my hypothesis. I found 4 types of operations that may be used in this notation:

  1. Tensor product, e.g. C[i,j] = A[i,k] * B[k,j].
  2. "Inner" contraction, e.g. A[i, j, i].
  3. Binary operations other than product, e.g. c[i] = a[i] + b[i].
  4. Unary functions, e.g. exp(c[i]).

The general rule for Einstein summation that I've seen so far tells that indices that are seen twice in an expression should be summed out, and indices that occur only once mean "for all such elements". However, in expression types above I see this applies only for the first 2 cases. E.g. in c[i] = a[i] + b[i] index i occurs twice, yet intuitively it means "for all", not "sum out".

Does my hypothesis sound right for you? If not, can you show an example that refutes it?

I wouldn't say I'm an expert, but it seems to me that computer scientists and engineers have generalized Einstein notation to include case 3, even though strictly speaking you should sum over i since it is shown twice. See: http://docs.scipy.org/doc/numpy/reference/generated/numpy.einsum.html

It may be an abuse of notation, but I think about it like "whatever indices are missing on the left hand side, but appear on the right hand side, should be summed out."

Are you interested in writing specialized code that is optimized for each of these four cases? Have you seen this - https://github.com/dgasmith/opt_einsum?

I would be very grateful if you were to open PRs to Einsum.jl for anything you come up with. I can use all the help I can get!

I would add outer products to your list (I suppose you could say it is subsumed by the tensor product you have as case 1):

@einsum A[i,j] = B[i] * C[j]

There are other use cases, though they might not be of interest to you. One would be the CP-decomposition of tensors:

@einsum X[i,j,k] = A[r,i] * B[r,j] * C[r,k]

Which approximates the tensor X as a sum (over r) of rank-one tensors (formed by outer products of the columns of the "factor matrices" A,B,C).

It may be an abuse of notation, but I think about it like "whatever indices are missing on the left hand side, but appear on the right hand side, should be summed out."

Unfortunately it doesn't work for me since I don't always have a left hand side :) In automatic differentiation (and I generally follow AD architecture) we split expressions into primitive calls and assignments, so, for example, expression D[i, j] = A[i, k] * B[k, j] + C[i, j] is transformed into:

W1[??] = A[i, k] * B[k, j]
W2[??] = W1[??] + C[i, j]
D[i, j] = W2[??]

where ?? is to be inferred. This is exactly why I'm trying to determine possible types of expressions and so far such an inference looks pretty doable: I already have subroutines for 4 types of expressions I described and I have an idea how to handle outer product and CP-decomposition as well (although I think I'll postpone it for now).

Regarding PRs to Einsum.jl, the plan is:

  1. Implement tensor differentiation in whatever way is easier.
  2. Refactor and clean up to make it look "nice".
  3. Integrate with other packages.

I'm still at the step 1, but I definitely have plans to move some code to Einstein.jl and/or TensorOperations.jl somewhere in between steps 2 and 3.

Ah, I see. No way to pass the indices (i, j in your example) down the computational graph? If not, just carefully document what the conventions are. I'm not sure how you want to handle the elementwise operations... something like this?

c[??] = a[i] + b[i]        # elementwise
c[??] = a[i] - b[i]        # elementwise
c[??] = a[i]*b[i]          # dot product
c[??] = d[i] + a[i]*b[i]   # error?

Sending a ping to @Jutho to join this discussion. I think what your doing may fit better with TensorOperations, and I know he was interested in further functionality along these lines. Keep playing around - I'm excited to see what you come up with!

IMHO, Einstein notation does not allow for element-wise operations (in the sense of Hadamard products). I also believe that these are not defined in tensor algebra and as such should not be supported. c[i] = a[i] + b[i] however is supported, as the general rule of Einstein's summation convention is that an index which occurs (at most) twice in a multiplication (or more specifically in a summand) implies summation over that index. This is not an extension of the notation but is purely its definition.

as the general rule of Einstein's summation convention is that an index which occurs (at most) twice in a multiplication

Thanks for clearing this up. I'm pretty sure numpy einsum also supports Hadamard products though.

these are not defined in tensor algebra and as such should not be supported

Why limit the scope to tensor algebra? Not saying your wrong - just curious about your reasons for thinking this. Either way, I agree it probably makes sense to start there.

Why limit the scope to tensor algebra?

I think that mixing algebraic rules may lead to incompatibility and ambiguity, however, this should be cleared up by someone more familiar with abstract algebra.

The most capable tensor algebra suite that I am aware of is the software Cadabra, perhaps its documentation can be of use to this discussion.

I'm not sure how you want to handle the elementwise operations... something like this?

@ahwillia I learnt about Einstein notation from 1 and it clearly describes both - elementwise operations and dot product. So your first 3 examples expand, according to this paper, into:

c[i] = a[i] + b[i]
c[i] = a[i] - b[i]
c = c[] = a[i]*b[i]

The last of your examples - c[??] = d[i] + a[i]*b[i] - indeed doesn't have explanation and should probably produce an error. Note that TensorOperations.jl treats these examples the same way (with dot product defined as @tensor c[] = a[i] * b[i] and producing 1 element vector).

Guys, I'm curious what's your primary use case for tensor algebra? I came from machine learning and mostly think about tensors as helpers for things like derivatives of matrix by matrix, but you seem to have more general (and I believe more systematic) vision of tensor algebra.

Not me! I'm mostly in the machine learning camp too.

In my work, tensors arise in continuum mechanics, differential geometry, and the exterior calculus of differential forms.

After several iterations of refactoring and reconsideration, I think I'm now close to assembling all needed parts to create a method for symbolic differentiation of tensors. Today I found one valuable detail that seems to come for free. Suppose we want to differentiate such expression:

y = sum(W*x + b)

where x, b are vectors and W is a matrix. We first split it into primitive expressions like this (this is called "forward pass" in parlance of automatic differentiation):

s = W*x
t = s + b
y = sum(t)

Let's say we want to find derivative dy/dW which is a matrix of the same size as W. We first start with a derivative of dy/dy which is just 1. Going back, we find that dy/dt is a vector with elements:

dy/dt[i] = t[i]

Then, using the chain rule, we find out that:

dy/ds[i] = dy/dt[i] * dt[i]/dt[i] = t[i] * 1 = t[i]

Interesting stuff starts when we try to find dy/dW since it involves an array with 3 dimensions (i.e. 3 indices):

dy/dW[j,k] = dy/ds[i] * ds[i]/dW[j, k]

We can express ds[i]/dW[j, k] (or any derivative of vector by matrix) using an index comparison trick (or using repeted indices as suggested in ahwillia/Einsum.jl#16, but let's ignore it for now):

ds[i]/dW[j, k] = (i == j) * x[k]

I.e. all components of the derivative where i != j will be zeros. Putting it into the chain rule, we get:

dy/dW[j,k] = dy/ds[i] * ds[i]/dW[j, k] = t[i] * (i == j) * x[k] = t[j] * x[k]

which is essentially an outer product of 2 vectors!

dy/dW = t * x'

So using only symbolic manipulations and no special simplification rules we've got fully vectorized output eligible for all kinds of optimizations (e.g. converting to BLAS).

Of course, this won't work that well for derivatives of higher order, e.g. dt/dW still can be expressed only using Einstein notation, but it's still better than most libraries for automatic or symbolic differentiation propose :)

commented

I am confused by dy/dt[i] = t[i] if y = sum(t). Shouldn't that be dy/dt[i] = 1. Or dy/dt[i] = t[i] if y = sum(t.^2)/2 ?

Ah, right! Thanks for the correction. Then, if I haven't missed anything again, we get:

dy/dt[i] = 1
dy/ds[i] = dy/dt[i] * dt[i]/dt[i] = 1 * 1 = 1
dy/dW[j,k] = dy/ds[i] * ds[i]/dW[j, k] = 1 * (i == j) * x[k] = (i == j) * x[k] = x[k]

i.e. for all j in J:

dy/dW[j,k] = x[k] 

which can be vectorized as:

dydW = repmat(x', 3)

(or somehow smarter)

I'll check it on a couple more examples, but I guess for most functions of form R^n -> R (i.e. most loss functions) we can generate fully vectorized code quite easily.

A note on the state of tensor differentiation. Currently multiplication of tensors up to rank 2 and elementwise operations are supported. Here's an example:

julia> ex = :(D[i,j] = A[i,k] * B[k,j] + log(C[i,j]))
:(D[i,j] = A[i,k] * B[k,j] + log(C[i,j]))

julia> using Espresso

julia> rdiff(ex, A=rand(3,2), B=rand(2, 3), C=rand(3,3))
3-element Array{Expr,1}:
 :(dD[i,j] / dA[k,l] = B[l,j] * (i == k))                 
 :(dD[i,j] / dB[k,l] = A[i,k] * (j == l))                 
 :(dD[i,j] / dC[k,l] = (1 / C[k,j]) * (j == l) * (i == k))

Where statements like i == k (I call them "guards", the name taken from Haskell) encode (in)dependencies between elements, e.g. dD[i,j] / dA[k,l] = B[l,j] * (i == k) means that D[i,j] depends on A[k,l] only when i == k, while for all other elements there's no dependency and derivative is zero.

Here's a new list of what needs to be done, mostly for me to keep track of these tasks:

  • conversion from vector to Einstein notation
  • conversion from Einstein to vector notation
  • support for constants; we should not search for derivatives of output w.r.t. constants
  • support for scalars; scalars are different from normal variables in that they do not have an index, e.g. C[i,j] = A[i, j] + 1
  • support for sum(x), mean(x), etc.; summation is very special for Einstein notation, one way to hande it is to transform sum(x) into x[i] * I[i], where I is a pseudo variable for which I[k] = 1.0 for any k
  • support inner product, c[] = a[i] * b[i]; outer product, e.g. c[i,j] = a[i] * b[j] is already supported
  • inner contraction; might be trivial or deadly hard, haven't thought about it at all yet
  • clean up, tests, documentation

I suppose inner product will also work if we can run eval(:(@einsum c[] := a[i] * b[i])), which currently produces an assertion error. @ahwillia does it fit into Einsum.jl (sorry if I have already asked it)?

Please, feel free to comment on existing items or add new ones.

Closing it since differentiation moved to XDiff.jl a long ago.