NVIDIA / Fuser

A Fusion Code Generator for NVIDIA GPUs (commonly known as "nvFuser")

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

allocation order propagation for matmul/linear

jjsjann123 opened this issue · comments

issue was raised by @jacobhinkle

We would like allocation order inference to populate proper allocation domain for inputs to matmul/linear ops.

i.e.

tv0 = fusion.define_tensor(...)
tv1 = fusion.define_tensor(...)
// magic operations that produces `tv0_derived` and `tv1_derived`

tv_out = fusion.ops.matmul(tv0_derived, tv1_derived)
// ...

with a vanilla fusion, tv0_derived and tv1_derived will have an empty allocation domain. This is not ideal, imagining if tv0 and tv1 comes in with a non-trivial allocation_domain.

The ask here is:

  1. we would want allocation order transpose to infer allocation order on tv0_derived and tv1_derived and populate it properly from their producers.
  2. the target of the propagation is recognized simply as inputs to matmul/linear operation. (or other pattern matching that we want to apply).
  3. We do NOT need to populate the allocation_order for tv_out, which should be better done by the scheduler.

Since scheduler is free to determine some output stride orders, does that mean we cannot really fully propagate it before segmentation? What if this was done during segmentation instead like when we get heuristics we could also query the output allocation domains. If we do that in topo order we could have the proper alloc dom available when computing heuristics/during scheduling.

Since scheduler is free to determine some output stride orders, does that mean we cannot really fully propagate it before segmentation?

The challenge here is to: 1. identify the boundary of each segments before the segmentation happened; 2. known how each segments' IO tensor would be mutated into different memory format by its schedulers.

What if this was done during segmentation instead like when we get heuristics we could also query the output allocation domains. If we do that in topo order we could have the proper alloc dom available when computing heuristics/during scheduling.

IIUC, this is suggesting that each scheduler's canSchedule would also consider updating an empty alloc dom of its output TensorView and properly giving that to the next segment? Yeah that would be a good to have thing as well.
With that said, having a global pass to coordinate across each fusion segments seems reasonable to have.

Question for @jacobhinkle , is the ask above what you were expecting from allocation order inference for now?

The ask here is:

  1. we would want allocation order transpose to infer allocation order on tv0_derived and tv1_derived and populate it properly from their producers.
  2. the target of the propagation is recognized simply as inputs to matmul/linear operation. (or other pattern matching that we want to apply).
  3. We do NOT need to populate the allocation_order for tv_out, which should be better done by the scheduler.

IIUC, this is suggesting that each scheduler's canSchedule would also consider updating an empty alloc dom of its output TensorView and properly giving that to the next segment?

Something like that, yes. For example in #2169 we might want to temporarily disallow matmul segments with a bias whose stride order does not match the output's. At minimum though, we'd want to have this available during proposeHeuristics and SchedulerEntry::makeEntry, which happens after segmentation is done and runtime order is determined. That way we'd be able to reliably infer the layout of matmuls based on input strides.