Transpose op doesn't support dynamic shape.
vanbasten23 opened this issue Β· comments
π Bug
I am running this backward pass NN model with dynamic input test on my TPU VM docker (backend is TPU) and it is failing with:
======================================================================
ERROR: test_backward_pass_with_dynamic_input (__main__.TestDynamicShapeModels)
----------------------------------------------------------------------
Traceback (most recent call last):
File "pytorch/xla/test/test_dynamic_shape_models.py", line 103, in test_backward_pass_with_dynamic_input
loss.backward()
File "/home/ptxla/.local/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/ptxla/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: /workspaces/work/pytorch/xla/torch_xla/csrc/helpers.cpp:278 : Check failed: out_size <= size_at_dyndim / input_shape.dimensions( input_dynamic_dimension) (2 vs. 1)
*** Begin stack trace ***
tsl::CurrentStackTrace[abi:cxx11]()
torch_xla::XlaHelpers::GetDynamicReshapeInfo(xla::Shape const&, absl::lts_20220623::Span<long const>)
torch_xla::XlaHelpers::GetDynamicReshape(xla::Shape const&, absl::lts_20220623::Span<long const>)
torch_xla::Permute::MakePermuteShape(xla::Shape const&, absl::lts_20220623::Span<long const>)
torch_xla::ViewInfo::ViewInfo(torch_xla::ViewInfo::Type, xla::Shape, std::vector<long, std::allocator<long> >)
torch_xla::tensor_methods::transpose(c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > const&, long, long)
torch_xla::XLANativeFunctions::t(at::Tensor const&)
at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&)
at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&)
at::_ops::t::call(at::Tensor const&)
torch::autograd::generated::AddmmBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)
torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
clone
*** End stack trace ***
Unable to map dynamic dimension of shape f32[<=10,2]{1,0} to output sizes (2, 10)
investigation
I don't think XLANativeFunctions::t
propagate dynamism properly. This is because when we compute the output shape here
XlaHelpers::Permute(permutation, source_shape.dimensions())
source_shape.dimensions()
only returns the upper bound and XlaHelpers::Permute
failed to consider the dynamic dimension via souce_shape.dynamic_dimensions()
. So the end result is: my input source_shape
prints f32[<=10,2]{1,0}
but XlaHelpers::Permute(permutation, source_shape.dimensions())
returns me an output shape [2, 10]
, which loses the dynamism.
That said, I may need to changestatic std::vector<typename Container::value_type> Permute(absl::Span<const int64_t> permutation, const Container& input)
returns at::SymIntArrayRef
instead of absl::Span<const int64_t>
and take an extra argument for dynamic shape.
Some other places may also need to be updated such as
xla/torch_xla/csrc/helpers.cpp
Line 300 in 10528bf
absl::Span<const int64_t> output_sizes
need to be of typeat::SymIntArrayRef
.xla/torch_xla/csrc/helpers.cpp
Line 255 in 10528bf
absl::Span<const int64_t> output_sizes
need to be of typeat::SymIntArrayRef
.
To Reproduce
On TPU VM, run
export XLA_EXPERIMENTAL="nonzero:masked_select"
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
python3 pytorch/xla/test/test_dynamic_shape_models.py TestDynamicShapeModels.test_backward_pass_with_dynamic_input
Expected behavior
It should not fail.
Environment
- Reproducible on XLA backend [CPU/TPU]: TPU
- torch_xla version: nightly.
Additional context
hi @JackCaoG , if the above investigation make sense, I wonder where functionalization comes to the picture and help this transpose
op.
Have you tried to run a unit test for t
op to confirm?
the analysis above seems coherent to me too, but i'm not sure if/why t
works in forward. if you can write a test that works/fails and post more detail that's be helpful
Well, t doesn't have an explicit derivatives.yaml definition, so it could be the backwards hits a different operator that doesn't support dynamic shapes. Can't easily check this right now but with logging you should be able to tell.
Btw, for enablement, I highly highly recommend making it error when you ask for a non symbolic size of a symbolic tensor. Makes it very easy to diagnose these problems