Is it possible to train models with multiple inputs?

Question

Is it possible to train models with multiple inputs?

emchristiansen opened this issue a year ago · comments

The example at dfdx/examples/05-optim.rs shows how to train a model with a single input (in this case x), but the API doesn't appear to support training a model with multiple inputs, e.g. a multimodal model with inputs images and text.

In particular, in the given example you create grads like this:

let mut grads = mlp.alloc_grads();

You then consume the grads in this line:

let prediction = mlp.forward_mut(x.trace(grads));

So, what should we do if there are multiple inputs, each of which would consume the grads?

Corey Lowman · Answer 1 · Mon Jul 24 2023 21:30:22 GMT+0800 (China Standard Time)

I think it depends on how your model works internally. What's the forward method look like? There's a bunch of ways you can do it. Probably the most straightforward is to just have one of the inputs have the tape, and then the forward method would have to make sure the tape is moved around properly.

Corey Lowman · Answer 2 · Wed Jul 26 2023 20:49:22 GMT+0800 (China Standard Time)

Will close this for now - feel free to keep asking questions here!

Eric Christiansen · Answer 3 · Wed Jul 26 2023 23:15:03 GMT+0800 (China Standard Time)

Can you expand on what you mean by:

Probably the most straightforward is to just have one of the inputs have the tape, and then the forward method would have to make sure the tape is moved around properly.

E.g. assume the forward method looks like this:

fn try_forward(
  &self,
  x: Tensor<...>,
  y: Tensor<...>,
) -> Result<Self::Output, Self::Error>;

And assume I've already called let x = x.trace(grads).
So x has the gradient tape and y has NoneTape.
How would I "move the tape around properly", assuming this is what you're suggesting?

Eric Christiansen · Answer 4 · Wed Jul 26 2023 23:40:46 GMT+0800 (China Standard Time)

I think I understand at least one source of my confusion: The word "tape" in gradient tape made me assume the underlying datastructure was linear (e.g. a stack that was accumulating intermediate partials in the style of the chain rule expansion)*.
So, I was trying to understand how the backprop pass would be "linearized" backwards along the forward path the tape traversed.

But, the tape is just a DAG, right?
And the current owner of the tape object correspond to the node in the DAG that I'll extend (add edges to) if I perform an op.
As a user, I just need to ensure that if I want to differentiate through a given op, one of the input tensors to the op must be the current owner of the tape object.

*For the record that is extremely confusing nomenclature. It's like calling something a "foo queue" when it's actually a tree under the hood. Oooof.

Eric Christiansen · Answer 5 · Thu Jul 27 2023 05:26:41 GMT+0800 (China Standard Time)

I wrote up an example demonstrating how difficult it is to deal with gradient tapes when your network has multiple inputs and outputs: https://gist.github.com/emchristiansen/8d84b3a36f1333526810e1c99a3a4335

Is there some technique I'm missing, or is it really this hard?

Eric Christiansen · Answer 6 · Thu Jul 27 2023 05:48:51 GMT+0800 (China Standard Time)

Also, I didn't like how I had to mentally keep track of which tensor had the tape, so I'm proposing this design: https://gist.github.com/emchristiansen/db80f5e85c791f6bb5bba5b78b750cd9

What do you think?

Corey Lowman · Answer 7 · Thu Jul 27 2023 20:40:25 GMT+0800 (China Standard Time)

Yeah at least one of the inputs must have the tape for an op to be recorded, then tapes are merged together later.

But, the tape is just a DAG, right?
And the current owner of the tape object correspond to the node in the DAG that I'll extend (add edges to) if I perform an op.
As a user, I just need to ensure that if I want to differentiate through a given op, one of the input tensors to the op must be the current owner of the tape object.

Yep exactly correct. It is indeed a DAG. GradientTape is pretty common nomenclature in the AD world. E.g. tensorflow has a similar gradient tape object https://www.tensorflow.org/api_docs/python/tf/GradientTape.

Is there some technique I'm missing, or is it really this hard?

Do the outputs stay completely separate forever? Normally you'd add them together at some point in which case the tapes would be merged by the add method.

Also, I didn't like how I had to mentally keep track of which tensor had the tape, so I'm proposing this design:

Feel free to do that, it is a totally valid way to move the tape around!