runtime error on pytorch 1.10
saintazunya opened this issue · comments
Describe the bug
A clear and concise description of what the bug is.
Environment
- Your operating system and version: ubuntu focal
- Your python version: 3.8
- Your PyTorch version: 1.10
- How did you install python (e.g. apt or pyenv)? Did you use a virtualenv?: conda
- Have you tried using latest bagua master (
python3 -m pip install --pre bagua
)?: yes
Reproducing
Please provide a minimal working example. This means the runnable code.
Please also write what exact commands are required to reproduce your results.
Just run Bagua example's benchmark script.
Additional context
Add any other context about the problem here.
Traceback (most recent call last):
File "/io/bagua/bagua/examples/benchmark/synthetic_benchmark.py", line 154, in <module>
model = model.with_bagua([optimizer], algorithm)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 396, in with_bagua
self._bagua_init_algorithm()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 441, in _bagua_init_algorithm
self._bagua_broadcast_parameters()
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/distributed.py", line 213, in _bagua_broadcast_parameters
broadcast(state, src=0)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/communication.py", line 523, in broadcast
comm.broadcast(tensor.to_bagua_tensor().bagua_backend_tensor(), src)
File "/opt/conda/envs/pytorch/lib/python3.9/site-packages/bagua/torch_api/tensor.py", line 79, in to_bagua_tensor
new_tensor = torch.Tensor(cdata=self._cdata)
RuntimeError: Creating a new Tensor subclass Tensor but the raw Tensor object is already associated to a python object of type Parameter
Killing subprocess 764
It seems that PyTorch 1.10 refuses to create new tensor from cdata pointer.
This will be fixed in next release.