Why do gradient tapes need to start at the inputs?

Question

Why do gradient tapes need to start at the inputs?

emchristiansen opened this issue a year ago · comments

Kudos on putting together a system that works for all the examples you've given, plus all the people that are evidently using this code!

But, I'm having a hard time understanding why you require that inputs have gradient tapes before they go into the model, such as in this line from dfdx/examples/05-optim.rs:

let prediction = mlp.forward_mut(x.trace(grads));

IIUC, this should only be necessary if the user wants to compute gradients of the output wrt the input, while almost all the time the user only wants gradients wrt the model parameters.

In addition, I'm not sure how to even make this pattern work in my case, as my input is variable-sized, inside nested structs, and the parts of the input which are enums get turned into onehot features inside the try_forward call.
FWIW tch easily handles even this weird use-case, as a rank zero tch Tensor is essentially a drop-in replacement for f32, but with gradients.
(Unfortunately tch has its own problems, mostly related to speed.)

It would be very helpful if you could provide some more examples / docs covering weird use cases such as mine, or maybe create a chat room where users of dfdx could help each other out 😁

Corey Lowman · Answer 1 · Mon Jul 24 2023 20:49:38 GMT+0800 (China Standard Time)

On theoretical side:

Tapes are part of the input/output of a model, not a part of the model itself.

I personally think it makes it a bit easier to reason about tapes in this way when the model splits or combines. You know whatever tape gets passed into the sub-model will be included in the output of that sub-model.

On technical side:

It's required on the input instead of model because tapes in tensors are a generic parameter that changes the type. So a tensor with a tape has a different type than a tensor that doesn't.

Model parameters are stored as tensors. If we stored the tape in the model parameters, then the type of the model parameters would change throughout the forward pass, since the tape would be passed along the model parameters.

It would be very helpful if you could provide some more examples / docs covering weird use cases such as mine, or maybe create a chat room where users of dfdx could help each other out 😁

Check out the discord (link at top of readme)! I'm not as active there but there are tons of people who can answer questions.

Always open to more examples! Want to post a minimal example of something similar to your input?

Corey Lowman · Answer 2 · Wed Jul 26 2023 20:49:47 GMT+0800 (China Standard Time)

Will close for now - feel free to open a PR with an example