NVIDIA / Fuser

proposal

AllocationDomainPass modifies the allocation domain in the fusion IR directly to reflect the optimized stride order for output. Which makes the allocation domain as a requirement in the program that schedulers have to follow.

It's important to distinguish which layouts are requirement of the user program that we must respect (i.e. a computation definition where it is explicitly asking an output in a certain memory layout), from what is an optimization from the codegen system, like the allocation order inference, which we could revert in later phase.

The proposal here is to assume that any AllocationDomain on output TVs are just an optimization hint. For computation defined requirement, we put in a separate stride order map in Fusion, similar to how we specify output aliases.

  std::optional<std::vector<int64_t>> getOutputStrideOrder(const TensorView* tv) const;
  void setOutputStrideOrder(const TensorView* tv, const std::vector<int64_t>& stride_order);

  std::unordered_map<const TensorView*, std::vector<int64_t>> output_stride_order_;

Later transformation would just need to ensure that allocation domain mutated on outputs remains consistent across segmented fusion and no violation against the ground truth in output_stride_order_.

issue

One issue we hit was that the pass tries to modify the output stride order for Linear, which is not yet supported (by ExprEvaluator). More importantly, the allocation domain inference result isn't optimal neither, since we are naively mapping the order of IterDomain in the allocation domain of a reference tensor.

For linear, we have inputs with shape b, m, k and weight with shape n, k producing an output with shape b, m, n.
The existing logic in the pass picks the higher rank input's b, m, k as the reference and tries to order mapped IDs on output with the same order (gives b, m) and push un-mapped IDs on the outside (gives n, b, m).

This means that the pass will try to produce a permuted output, given non permuted inputs for linear and this might not be the optimal decision.

The problem is once we have modified the allocation domain on the output, it's impossible for the scheduler to tell the difference on whether this is just an optimization hint or rather a user requirement. So we can not undo the changes later.

The issue originated from the comment.

more context

We should remove all the old permutation support code in Fusion:

Fuser/csrc/fusion.h

Lines 495 to 499 in bad998a

    
           // See Note [ Permutation support in nvfuser ] 
        
           // map from indices of input tensor to permutation 
        
           PermutationMap permuted_input_map_; 
        
           // map from indices of output tensor to permutation 
        
           PermutationMap permuted_output_map_;

since those are obsolete TorchScript code.

I don’t really understand this issue. Could you please explain in more detail. I don’t think it’s good practice to reference a comment in an issue. Please summarize the problem, and include examples of current behavior and what behavior we’d like to see.

I don’t really understand this issue. Could you please explain in more detail. I don’t think it’s good practice to reference a comment in an issue. Please summarize the problem, and include examples of current behavior and what behavior we’d like to see.

Sorry about the vague issue in the first place. Attempted a rewrite hopefully it's looking more concrete this time 🤞 .

Sounds like what we are seeing here is that the allocation order pass, which is a preseg pass, does something, but that may need to be changed by a scheduler because it's not optimal. I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that. Am I understanding correctly?

Am I understanding correctly?

Yeah you got it right.

I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that

That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.

Am I understanding correctly?

Yeah you got it right.

I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that

That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.

I'm more interested in what would happen if this inference is done entirely by the schedulers. Presumably, that would solve the issue, right? Why should it be done as a preseg pass?

Am I understanding correctly?
Yeah you got it right.

I thought preseg passes were mostly just for straightforward optimization that should always be applied, but looks like it's getting more complex than that

That could be true. But then we just don't have a good place for allocation order pass. Arguably allocation order pass should have done a better job and not give non-optimal allocation domain. But I don't know how realistic that expectation is.

I'm more interested in what would happen if this inference is done entirely by the schedulers. Presumably, that would solve the issue, right? Why should it be done as a preseg pass?

Re: this inference is done entirely by the schedulers. We had discussed that possibility before and we agreed that's something worth exploring. Going in that direction, I think we would still want something similar to the proposed change in this issue.

Scheduler-run allocation domain inference is going to need work done at segmentation, i.e. if a scheduler takes a segment and decides to change the allocation domain on boundary tensors (be it input or output), that needs to be updated and communicated to the corresponding consumer (or producer) segments. Like in the existing problem, we'll need to be able to tell the difference between scheduler imposed allocation domain vs computation defined allocation domain. So when different scheduler makes different decision on an intermediate tensor, we will be able to know to which TensorView's allocation domain can be altered.

	// See Note [ Permutation support in nvfuser ]
	// map from indices of input tensor to permutation
	PermutationMap permuted_input_map_;
	// map from indices of output tensor to permutation
	PermutationMap permuted_output_map_;

`allocation order inference` should be a hint rather than a requirement.

proposal

issue

more context