pytorch / pytorch

We now have a few code places where we have special case where we return specific subgradients.
For example here for the norm function at 0.

We now have few functions that perform the same thing.
It would be interesting to allow users that know that they won't encounter these problematic points a way to disable these operations.

We could provide a general flag that disable these operations during the backward pass.

Either while the flag is set (so the flag should be set for the whole backward pass in general). Unless hooks are used to change it.
Either during the forward pass and we record it as an argument to the backward function.

cc @ezyang @ssnl @albanD @zou3519 @gqchen @VitalyFedyunin @ngimel

So, something like anomaly mode? It sounds plausible, just need a proposal!

Like anomaly mode but in reverse: It makes your code go faster, but can introduce extra NaNs if you're evaluating your functions at the wrong points.

One question though is that since this would actually be enabled for training (and not only for debugging). Should we make this a global flag or a thread local one? Is there special care that need to be taken when adding new thread local flags?

Relevant discussion: #35666

Hello,
Just to leave my thoughts here as I am dealing with some similar situation.

I think that we need some kind of mechanism to identify compound operations inside the graph.
For example, the case of a special gradient in an operation composed of multiple small steps where the gradient code is automatically generated would need to mark in the graph which nodes do belong to that operation. (I don't know if it's ok to mark this in ATeN somehow, probably ...)
Then in autograd derivatives, we can define different execution paths for that code. Check if its an anomaly and return a special value skipping the marked sub-graph. Otherwise, it will just run the automatically calculated backward chain.

The API could be something like:

ATeN code

Tensor a = at::add(x, y).autograd_group('group1')
Tensor b = at::mul(a, y).autograd_group('group1')

Autograd specification

- name: group1(Tensor self)
- special_case: function_that_checks_for_special_inputs(.....)

I am still not very familiar with all this mechanics so sorry if what I am proposing here makes no sense 😥.

I think the way we do this today is by just write a custom backward for this set of op and check for special inputs in the backward function we define.
While I think your idea is interesting. Would it help performance-wise compared to the current solution? (it would make it cleaner for sure :D ) We will still have to do the special casing for some values right?

I think performance won't change at all.
In some cases, memoization strategies could be applied easily for some sub-graphs resulting in performance benefits. We can even use nested sub-graphs inside a single compound operation where we can apply these kind of optimizations.

The special case should be controlled by a single function that can be defined either in ATeN or in Functions.cpp templates.
This will allow avoiding rewriting complex gradient calculations for long chains in the graph making the code more maintainable and simpler to understand.

Maybe I am just creating a problem where there wasn't one before 😅

I think the solution we have is "ok" for now. Given that this is not a widely spread issue. And for these complex functions, we are usually happy to write a custom backward anyway because it is much more memory efficient and runs faster.

The question here would be how to provide an escape hatch for people that don't want to pay the price for these special case when they know they won't hit them.

I agree with @emcastillo that we should avoid writing custom backwards for parts of the composite functions where it is possible. Take cdist that started it, for example:

pytorch/aten/src/ATen/native/Distance.cpp

Lines 28 to 38 in a836c4c

    
           Tensor euclidean_dist_out(const Tensor& x1, const Tensor& x2) { 
        
             Tensor x1_norm = x1.pow(2).sum(-1, true); 
        
             Tensor x1_pad = at::ones_like(x1_norm, LEGACY_CONTIGUOUS_MEMORY_FORMAT); 
        
             Tensor x2_norm = x2.pow(2).sum(-1, true); 
        
             Tensor x2_pad = at::ones_like(x2_norm, LEGACY_CONTIGUOUS_MEMORY_FORMAT); 
        
             Tensor x1_ = at::cat({x1.mul(-2), x1_norm, x1_pad}, -1); 
        
             Tensor x2_ = at::cat({x2, x2_pad, x2_norm}, -1); 
        
             Tensor result = x1_.matmul(x2_.transpose(-2, -1)); 
        
             result.clamp_min_(0).sqrt_(); 
        
             return result; 
        
           }

Most of it is fine and can be left to autograd, the problematic part is clip and sqrt, and backward only for this part should be customized.
And yes, as this issue proposes, there should be an escape hatch for people who want to live dangerously (but customized backward can provide it).

More special cases: #31829 #37323 #40336 (comment)

And already mentioned request: #35666 (comment)

	Tensor euclidean_dist_out(const Tensor& x1, const Tensor& x2) {
	Tensor x1_norm = x1.pow(2).sum(-1, true);
	Tensor x1_pad = at::ones_like(x1_norm, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
	Tensor x2_norm = x2.pow(2).sum(-1, true);
	Tensor x2_pad = at::ones_like(x2_norm, LEGACY_CONTIGUOUS_MEMORY_FORMAT);
	Tensor x1_ = at::cat({x1.mul(-2), x1_norm, x1_pad}, -1);
	Tensor x2_ = at::cat({x2, x2_pad, x2_norm}, -1);
	Tensor result = x1_.matmul(x2_.transpose(-2, -1));
	result.clamp_min_(0).sqrt_();
	return result;
	}

Efficient handling special gradient values in the autograd