pytorch / xla

🐛 Bug

I am running this backward pass NN model with dynamic input test on my TPU VM docker (backend is TPU) and it is failing with:

======================================================================
ERROR: test_backward_pass_with_dynamic_input (__main__.TestDynamicShapeModels)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "pytorch/xla/test/test_dynamic_shape_models.py", line 103, in test_backward_pass_with_dynamic_input
    loss.backward()
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: /workspaces/work/pytorch/xla/torch_xla/csrc/helpers.cpp:278 : Check failed: out_size <= size_at_dyndim / input_shape.dimensions( input_dynamic_dimension) (2 vs. 1)
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        torch_xla::XlaHelpers::GetDynamicReshapeInfo(xla::Shape const&, absl::lts_20220623::Span<long const>)
        torch_xla::XlaHelpers::GetDynamicReshape(xla::Shape const&, absl::lts_20220623::Span<long const>)
        torch_xla::Permute::MakePermuteShape(xla::Shape const&, absl::lts_20220623::Span<long const>)
        torch_xla::ViewInfo::ViewInfo(torch_xla::ViewInfo::Type, xla::Shape, std::vector<long, std::allocator<long> >)
        torch_xla::tensor_methods::transpose(c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > const&, long, long)
        torch_xla::XLANativeFunctions::t(at::Tensor const&)

        at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&)

        at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&)

        at::_ops::t::call(at::Tensor const&)

        torch::autograd::generated::AddmmBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)

        torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
        torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
        torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
        torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)


        clone
*** End stack trace ***
Unable to map dynamic dimension of shape f32[<=10,2]{1,0} to output sizes (2, 10)

investigation

I don't think XLANativeFunctions::t propagate dynamism properly. This is because when we compute the output shape here

XlaHelpers::Permute(permutation, source_shape.dimensions())

source_shape.dimensions() only returns the upper bound and XlaHelpers::Permute failed to consider the dynamic dimension via souce_shape.dynamic_dimensions(). So the end result is: my input source_shape prints f32[<=10,2]{1,0} but XlaHelpers::Permute(permutation, source_shape.dimensions()) returns me an output shape [2, 10], which loses the dynamism.

That said, I may need to changestatic std::vector<typename Container::value_type> Permute(absl::Span<const int64_t> permutation, const Container& input) returns at::SymIntArrayRef instead of absl::Span<const int64_t> and take an extra argument for dynamic shape.
Some other places may also need to be updated such as

xla/torch_xla/csrc/helpers.cpp

Line 300 in 10528bf

xla::XlaOp XlaHelpers::DynamicReshape(xla::XlaOp input,

, the argument absl::Span<const int64_t> output_sizes need to be of type at::SymIntArrayRef.
xla/torch_xla/csrc/helpers.cpp

Line 255 in 10528bf

XlaHelpers::GetDynamicReshapeInfo(const xla::Shape& input_shape,

, the argument absl::Span<const int64_t> output_sizes need to be of type at::SymIntArrayRef.

To Reproduce

On TPU VM, run

export XLA_EXPERIMENTAL="nonzero:masked_select"
export XRT_TPU_CONFIG="localservice;0;localhost:51011"
python3 pytorch/xla/test/test_dynamic_shape_models.py TestDynamicShapeModels.test_backward_pass_with_dynamic_input

Expected behavior

It should not fail.

Environment

Reproducible on XLA backend [CPU/TPU]: TPU
torch_xla version: nightly.

Additional context

cc @miladm @JackCaoG

hi @JackCaoG , if the above investigation make sense, I wonder where functionalization comes to the picture and help this transpose op.

Have you tried to run a unit test for t op to confirm?

second question: I know we call t in the forward pass and everything works as expect. why does the t op not work on the backward pass? Is there anything different in the upstream? I suggest we do a unit test to clarify better.

CC @wconstab @ezyang

the analysis above seems coherent to me too, but i'm not sure if/why t works in forward. if you can write a test that works/fails and post more detail that's be helpful

Well, t doesn't have an explicit derivatives.yaml definition, so it could be the backwards hits a different operator that doesn't support dynamic shapes. Can't easily check this right now but with logging you should be able to tell.

Btw, for enablement, I highly highly recommend making it error when you ask for a non symbolic size of a symbolic tensor. Makes it very easy to diagnose these problems