BaguaSys / bagua

Bagua Speeds up PyTorch

Home Page:https://tutorials.baguasys.com/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

runtime error on pytorch 1.10

saintazunya opened this issue · comments

Describe the bug
A clear and concise description of what the bug is.

Environment

  • Your operating system and version: ubuntu focal
  • Your python version: 3.8
  • Your PyTorch version: 1.10
  • How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: conda
  • Have you tried using latest bagua master (python3 -m pip install --pre bagua)?: yes

Reproducing

Please provide a minimal working example. This means the runnable code.

Please also write what exact commands are required to reproduce your results.

Just run Bagua example's benchmark script.

Additional context
Add any other context about the problem here.

Traceback (most recent call last):
  File "/io/bagua/bagua/examples/benchmark/synthetic_benchmark.py", line 154, in <module>
    model = model.with_bagua([optimizer], algorithm)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 396, in with_bagua
    self._bagua_init_algorithm()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 441, in _bagua_init_algorithm
    self._bagua_broadcast_parameters()
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 213, in _bagua_broadcast_parameters
    broadcast(state, src=0)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/communication.py", line 523, in broadcast
    comm.broadcast(tensor.to_bagua_tensor().bagua_backend_tensor(), src)
  File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/tensor.py", line 79, in to_bagua_tensor
    new_tensor = torch.Tensor(cdata=self._cdata)
RuntimeError: Creating a new Tensor subclass Tensor but the raw Tensor object is already associated to a python object of type Parameter
Killing subprocess 764
commented

It seems that PyTorch 1.10 refuses to create new tensor from cdata pointer.

This will be fixed in next release.