LR_FIND() does not work in DDP anymore, RuntimeError: No backend type associated with device type cpu
asusdisciple opened this issue · comments
asusdisciple commented
Bug description
If you call the lr_find() method in a ddp setting you now get the error.
What version are you seeing the problem on?
v2.2
How to reproduce the bug
# call Trainer
trainer = L.Trainer(strategy="ddp", devices=[0,1,2,3]) ....
tuner = Tuner(trainer)
tuner.lr_find(model, data)
trainer.fit(model, data)
Error messages and logs
train.py 115 <module>
trainer.fit(model, mydata)
trainer.py 544 fit
call._call_and_handle_interrupt(
call.py 43 _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)
subprocess_script.py 105 launch
return function(*args, **kwargs)
trainer.py 580 _fit_impl
self._run(model, ckpt_path=ckpt_path)
trainer.py 987 _run
results = self._run_stage()
trainer.py 1031 _run_stage
self._run_sanity_check()
trainer.py 1060 _run_sanity_check
val_loop.run()
utilities.py 182 _decorator
return loop_run(self, *args, **kwargs)
evaluation_loop.py 142 run
return self.on_run_end()
evaluation_loop.py 254 on_run_end
self._on_evaluation_epoch_end()
evaluation_loop.py 336 _on_evaluation_epoch_end
trainer._logger_connector.on_epoch_end()
logger_connector.py 195 on_epoch_end
metrics = self.metrics
logger_connector.py 234 metrics
return self.trainer._results.metrics(on_step)
result.py 483 metrics
value = self._get_cache(result_metric, on_step)
result.py 447 _get_cache
result_metric.compute()
result.py 289 wrapped_func
self._computed = compute(*args, **kwargs)
result.py 251 compute
cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)
ddp.py 342 reduce
return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)
distributed.py 172 _sync_ddp_if_available
return _sync_ddp(result, group=group, reduce_op=reduce_op)
distributed.py 222 _sync_ddp
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)
c10d_logger.py 75 wrapper
return func(*args, **kwargs)
distributed_c10d.py 2219 all_reduce
work = group.allreduce([tensor], opts)
RuntimeError:
No backend type associated with device type cpu
Environment
Current environment
#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):
More info
No response