trainingModel compile failed

Question

trainingModel compile failed

Lime-Cakes opened this issue 2 years ago · comments

I tried porting a pytorch model to run and train on IPU, but ran into the following problem. Is there a solution? It seemed to be an issue with model size.

[13:23:22.126] [poptorch::python] [warning] Input tensor has requires_grad=True set. This tensor will be detached because backward pass via inputs is not supported.
/usr/local/lib/python3.8/dist-packages/diffusers/models/unet_2d_condition.py:273: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if any(s % default_overall_up_factor != 0 for s in sample.shape[-2:]):
/usr/local/lib/python3.8/dist-packages/torch/nn/functional.py:2359: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  _verify_batch_size([input.size(0) * input.size(1) // num_groups, num_groups] + list(input.size()[2:]))
/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py:465: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  tensor = tensor.reshape(batch_size, seq_len, head_size, dim // head_size)
/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py:466: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  tensor = tensor.permute(0, 2, 1, 3).reshape(batch_size * head_size, seq_len, dim // head_size)
/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py:472: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  tensor = tensor.reshape(batch_size // head_size, head_size, seq_len, dim)
/usr/local/lib/python3.8/dist-packages/diffusers/models/attention.py:473: UserWarning: __floordiv__ is deprecated, and its behavior will change in a future version of pytorch. It currently rounds toward 0 (like the 'trunc' function NOT 'floor'). This results in incorrect rounding for negative values. To keep the current behavior, use torch.div(a, b, rounding_mode='trunc'), or for actual floor division, use torch.div(a, b, rounding_mode='floor').
  tensor = tensor.permute(0, 2, 1, 3).reshape(batch_size // head_size, seq_len, dim * head_size)
/usr/local/lib/python3.8/dist-packages/diffusers/models/resnet.py:111: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.8/dist-packages/diffusers/models/resnet.py:116: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.8/dist-packages/diffusers/models/resnet.py:39: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  assert hidden_states.shape[1] == self.channels
/usr/local/lib/python3.8/dist-packages/diffusers/models/resnet.py:52: TracerWarning: Converting a tensor to a Python boolean might cause the trace to be incorrect. We can't record the data flow of Python values, so this value will be treated as a constant in the future. This means that the trace might not generalize to other inputs!
  if hidden_states.shape[0] >= 64:
[libprotobuf ERROR google/protobuf/message_lite.cc:447] onnx.ModelProto exceeded maximum protobuf size of 2GB: 3222639541
Graph compilation:   0%|                                       | 0/100 [00:00<?]2022-11-22T13:24:44.978173Z popart:devicex 449.449 W: The `debug.retainDebugInformation` engine option was implicitly set to `true`. The default will change to `false` in a future release. Set it to `true` explicitly if you want to query debug information (for example, by calling `Session::getReport`).
2022-11-22T13:24:44.983117Z popart:popart 449.449 E: Could not find loss tensor 'IdentityLoss:0' in main graph tensors

[0] popart::Ir::prepareImpl(popart::IrBundle const&, std::map<unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, unsigned long)
[1] popart::Ir::prepare(popart::IrBundle const&, std::map<unsigned long, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::less<unsigned long>, std::allocator<std::pair<unsigned long const, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > > const&, unsigned long)
[2] popart::Session::configureFromOnnx(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, popart::DataFlow const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, popart::Optimizer const*, popart::InputShapeInfo const&, std::shared_ptr<popart::DeviceInfo>, popart::SessionOptions const&, popart::Patterns const&)
[3] popart::TrainingSession::createFromOnnxModel(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, popart::DataFlow const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, popart::Optimizer const&, std::shared_ptr<popart::DeviceInfo>, popart::InputShapeInfo const&, popart::SessionOptions const&, popart::Patterns const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)
[4] poptorch::Compiler::initSession(std::vector<poptorch::Optimizer, std::allocator<poptorch::Optimizer> > const&, char const*)
[5] poptorch::detail::LowerToPopartImpl::compile()
[6] poptorch::LowerToPopart::compile()



[13:24:45.056] [poptorch::python] [critical] poptorch.poptorch_core.Error: In poptorch/poptorch_err/include/poptorch_err/ExceptionHandling.hpp:76: 'popart_exception': Could not find loss tensor 'IdentityLoss:0' in main graph tensors
Error raised in:
  [0] popart::TrainingSession::createFromOnnxModel
  [1] Compiler::initSession
  [2] LowerToPopart::compile
  [3] compileWithTrace


Traceback (most recent call last):
  File "train-ipu.py", line 452, in <module>
    main()
  File "train-ipu.py", line 412, in main
    trainModel.compile(*datum)
  File "/usr/local/lib/python3.8/dist-packages/poptorch/_poplar_executor.py", line 752, in compile
    self._compile(in_tensors)
  File "/usr/local/lib/python3.8/dist-packages/poptorch/_impl.py", line 288, in wrapper
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/poptorch/_poplar_executor.py", line 663, in _compile
    self._executable = poptorch_core.compileWithTrace(*trace_args)
poptorch.poptorch_core.Error: In poptorch/poptorch_err/include/poptorch_err/ExceptionHandling.hpp:76: 'popart_exception': Could not find loss tensor 'IdentityLoss:0' in main graph tensors
Error raised in:
  [0] popart::TrainingSession::createFromOnnxModel
  [1] Compiler::initSession
  [2] LowerToPopart::compile
  [3] compileWithTrace

I have the loss defined as following:

loss = F.mse_loss(noise_pred, noise, reduction="mean")
pop_loss = poptorch.identity_loss(loss, reduction="none")

Anthony Barbier · Answer 1 · Fri Nov 25 2022 17:20:29 GMT+0800 (China Standard Time)

Based on this error message:

[libprotobuf ERROR google/protobuf/message_lite.cc:447] onnx.ModelProto exceeded maximum protobuf size of 2GB: 3222639541

I think the problem is that your model doesn't fit inside the ONNX protobuf representation and as a result gets truncated.

You could try to keep the weights outside the model to see if it helps:

opts = poptorch.Options()
opts._Popart.set("saveInitializersToFile", "my_weights.onnx")

Lime-Cakes · Answer 2 · Sun Nov 27 2022 18:13:49 GMT+0800 (China Standard Time)

Thanks, that kinda helps, but compiling still fails