BWD nn test with dynamic input without sigmoid results in a new error

Question

BWD nn test with dynamic input without sigmoid results in a new error

vanbasten23 opened this issue a year ago · comments

🐛 Bug

BWD nn test with dynamic input without sigmoid results in a new error.
A similar model, BWD nn test with dynamic input with sigmoid, results in a error in autograd: #4322. So I replaced the sigmoid with relu and the new model failed with a new error:

Traceback (most recent call last):
  File "pytorch/xla/test/test_dynamic_shape_backward_models.py", line 82, in <module>
    train(model, loss_fn=criterion, optimizer=optimizer)
  File "pytorch/xla/test/test_dynamic_shape_backward_models.py", line 69, in train
    loss.backward()
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/_tensor.py", line 484, in backward
    torch.autograd.backward(
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: torch_xla/csrc/helpers.cpp:273 : Check failed: out_size <= size_at_dyndim / input_shape.dimensions( input_dynamic_dimension) (10 vs. 1)
*** Begin stack trace ***
        tsl::CurrentStackTrace[abi:cxx11]()
        torch_xla::XlaHelpers::GetDynamicReshapeInfo(xla::Shape const&, absl::lts_20220623::Span<long const>)
        torch_xla::XlaHelpers::GetDynamicReshape(xla::Shape const&, absl::lts_20220623::Span<long const>)
        torch_xla::Permute::MakePermuteShape(xla::Shape const&, absl::lts_20220623::Span<long const>)
        torch_xla::ViewInfo::ViewInfo(torch_xla::ViewInfo::Type, xla::Shape, std::vector<long, std::allocator<long> >)
        torch_xla::tensor_methods::transpose(c10::intrusive_ptr<torch_xla::XLATensor, c10::detail::intrusive_target_default_null_type<torch_xla::XLATensor> > const&, long, long)
        torch_xla::XLANativeFunctions::t(at::Tensor const&)


        at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&)

        at::_ops::t::redispatch(c10::DispatchKeySet, at::Tensor const&)

        at::_ops::t::call(at::Tensor const&)

        torch::autograd::generated::AddmmBackward0::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&)

        torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&, std::shared_ptr<torch::autograd::ReadyQueue> const&)
        torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&)
        torch::autograd::Engine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)
        torch::autograd::python::PythonEngine::thread_init(int, std::shared_ptr<torch::autograd::ReadyQueue> const&, bool)


        clone
*** End stack trace ***
Unable to map dynamic dimension of shape f32[<=80,10]{1,0} to output sizes (10, 80)

full error with print statement.

To Reproduce

Run the script from pr on TPU VM:

export XRT_TPU_CONFIG="localservice;0;localhost:51011"
export XLA_EXPERIMENTAL="nonzero:masked_select"
python3 pytorch/xla/test/test_dynamic_shape_backward_models.py

Expected behavior

It shouldn't crash.

Environment

Reproducible on XLA backend [CPU/TPU]: TPU
torch_xla version: HEAD

Additional context

iefgnoix · Answer 1 · Wed Dec 21 2022 02:36:31 GMT+0800 (China Standard Time)

I guess this may have something to do with the view op but I'm not sure and may need more digging.

iefgnoix · Answer 2 · Wed Dec 21 2022 05:35:22 GMT+0800 (China Standard Time)

If I replace the activation function torch.nn.ReLU() with torch.nn.tanh, I got the same error as when I use Sigmoid:

Traceback (most recent call last):
  File "pytorch/xla/test/test_dynamic_shape_backward_models_tanh.py", line 80, in <module>
    train(model, loss_fn=criterion, optimizer=optimizer)
  File "pytorch/xla/test/test_dynamic_shape_backward_models_tanh.py", line 67, in train
    loss.backward()
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/_tensor.py", line 484, in backward
    torch.autograd.backward(
  File "/home/ptxla/.local/lib/python3.8/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: Function TanhBackward0 returned an invalid gradient at index 0 - got [80, 1] but expected shape compatible with [<=80, 1]

iefgnoix · Answer 3 · Wed Dec 21 2022 08:44:40 GMT+0800 (China Standard Time)

What's interesting is that if I add a print statement at https://github.com/pytorch/pytorch/blob/2a37ba8e81604fda5ba78fe5ee8c8662ce0c25f3/torch/csrc/autograd/engine.cpp#L887:

std::cerr << "xw32, file=" << __FILE__ << ", line=" << __LINE__ << "function=" << __FUNCTION__ << ": inputs=" << inputs << std::endl;

and print out the input. The script gives me a completely different XLA error: https://paste.googleplex.com/6313688683249664. We've seen this error at some other places. It's off the topic but it's good to learn.

Manfei · Answer 4 · Wed Dec 28 2022 03:18:22 GMT+0800 (China Standard Time)

is it ok to assign this to you, @vanbasten23 ?

iefgnoix · Answer 5 · Fri Feb 24 2023 02:03:44 GMT+0800 (China Standard Time)

Update: I have modified transpose op to support dynamism.
Remaining work for this issue is to change torch.nn.tanh to propagate dynamism properly. The change should be similar to what I did for Sigmoid.