allocation order propagation for matmul/linear

Question

allocation order propagation for matmul/linear

jjsjann123 opened this issue 4 months ago · comments

issue was raised by @jacobhinkle

We would like allocation order inference to populate proper allocation domain for inputs to matmul/linear ops.

i.e.

tv0 = fusion.define_tensor(...)
tv1 = fusion.define_tensor(...)
// magic operations that produces `tv0_derived` and `tv1_derived`

tv_out = fusion.ops.matmul(tv0_derived, tv1_derived)
// ...

with a vanilla fusion, tv0_derived and tv1_derived will have an empty allocation domain. This is not ideal, imagining if tv0 and tv1 comes in with a non-trivial allocation_domain.

The ask here is:

we would want allocation order transpose to infer allocation order on tv0_derived and tv1_derived and populate it properly from their producers.
the target of the propagation is recognized simply as inputs to matmul/linear operation. (or other pattern matching that we want to apply).
We do NOT need to populate the allocation_order for tv_out, which should be better done by the scheduler.

Jacob Hinkle · Answer 1 · Sun May 05 2024 18:15:41 GMT+0800 (China Standard Time)

Since scheduler is free to determine some output stride orders, does that mean we cannot really fully propagate it before segmentation? What if this was done during segmentation instead like when we get heuristics we could also query the output allocation domains. If we do that in topo order we could have the proper alloc dom available when computing heuristics/during scheduling.

jjsjann123 · Answer 2 · Tue May 07 2024 06:53:20 GMT+0800 (China Standard Time)

Since scheduler is free to determine some output stride orders, does that mean we cannot really fully propagate it before segmentation?

The challenge here is to: 1. identify the boundary of each segments before the segmentation happened; 2. known how each segments' IO tensor would be mutated into different memory format by its schedulers.

What if this was done during segmentation instead like when we get heuristics we could also query the output allocation domains. If we do that in topo order we could have the proper alloc dom available when computing heuristics/during scheduling.

IIUC, this is suggesting that each scheduler's canSchedule would also consider updating an empty alloc dom of its output TensorView and properly giving that to the next segment? Yeah that would be a good to have thing as well.
With that said, having a global pass to coordinate across each fusion segments seems reasonable to have.

jjsjann123 · Answer 3 · Tue May 07 2024 06:55:03 GMT+0800 (China Standard Time)

Question for @jacobhinkle , is the ask above what you were expecting from allocation order inference for now?

The ask here is:

we would want allocation order transpose to infer allocation order on tv0_derived and tv1_derived and populate it properly from their producers.

the target of the propagation is recognized simply as inputs to matmul/linear operation. (or other pattern matching that we want to apply).

We do NOT need to populate the allocation_order for tv_out, which should be better done by the scheduler.

Jacob Hinkle · Answer 4 · Tue May 07 2024 08:10:52 GMT+0800 (China Standard Time)

IIUC, this is suggesting that each scheduler's canSchedule would also consider updating an empty alloc dom of its output TensorView and properly giving that to the next segment?

Something like that, yes. For example in #2169 we might want to temporarily disallow matmul segments with a bias whose stride order does not match the output's. At minimum though, we'd want to have this available during proposeHeuristics and SchedulerEntry::makeEntry, which happens after segmentation is done and runtime order is determined. That way we'd be able to reliably infer the layout of matmuls based on input strides.