runtime error on pytorch 1.10

Question

runtime error on pytorch 1.10

saintazunya opened this issue 3 years ago · comments

saintazunya commented 3 years ago

Describe the bug
A clear and concise description of what the bug is.

Environment

Your operating system and version: ubuntu focal
Your python version: 3.8
Your PyTorch version: 1.10
How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: conda
Have you tried using latest bagua master (python3 -m pip install --pre bagua)?: yes

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Just run Bagua example's benchmark script.

Additional context
Add any other context about the problem here.

Traceback (most recent call last):
  File "/io/bagua/bagua/examples/benchmark/synthetic_benchmark.py", line 154, in <module>
    model = model.with_bagua([optimizer], algorithm)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 396, in with_bagua
    self._bagua_init_algorithm()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 441, in _bagua_init_algorithm
    self._bagua_broadcast_parameters()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 213, in _bagua_broadcast_parameters
    broadcast(state, src=0)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/communication.py", line 523, in broadcast
    comm.broadcast(tensor.to_bagua_tensor().bagua_backend_tensor(), src)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/tensor.py", line 79, in to_bagua_tensor
    new_tensor = torch.Tensor(cdata=self._cdata)
RuntimeError: Creating a new Tensor subclass Tensor but the raw Tensor object is already associated to a python object of type Parameter
Killing subprocess 764

Shawn · Answer 1 · Thu Oct 28 2021 10:47:44 GMT+0800 (China Standard Time)

It seems that PyTorch 1.10 refuses to create new tensor from cdata pointer.

This will be fixed in next release.