LR_FIND() does not work in DDP anymore, RuntimeError: No backend type associated with device type cpu

Question

LR_FIND() does not work in DDP anymore, RuntimeError: No backend type associated with device type cpu

asusdisciple opened this issue 2 months ago · comments

asusdisciple commented 2 months ago

Bug description

If you call the lr_find() method in a ddp setting you now get the error.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

# call Trainer
trainer = L.Trainer(strategy="ddp", devices=[0,1,2,3]) ....
tuner = Tuner(trainer)
tuner.lr_find(model, data)
trainer.fit(model, data)

Error messages and logs

train.py 115 <module>
trainer.fit(model, mydata)

trainer.py 544 fit
call._call_and_handle_interrupt(

call.py 43 _call_and_handle_interrupt
return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, **kwargs)

subprocess_script.py 105 launch
return function(*args, **kwargs)

trainer.py 580 _fit_impl
self._run(model, ckpt_path=ckpt_path)

trainer.py 987 _run
results = self._run_stage()

trainer.py 1031 _run_stage
self._run_sanity_check()

trainer.py 1060 _run_sanity_check
val_loop.run()

utilities.py 182 _decorator
return loop_run(self, *args, **kwargs)

evaluation_loop.py 142 run
return self.on_run_end()

evaluation_loop.py 254 on_run_end
self._on_evaluation_epoch_end()

evaluation_loop.py 336 _on_evaluation_epoch_end
trainer._logger_connector.on_epoch_end()

logger_connector.py 195 on_epoch_end
metrics = self.metrics

logger_connector.py 234 metrics
return self.trainer._results.metrics(on_step)

result.py 483 metrics
value = self._get_cache(result_metric, on_step)

result.py 447 _get_cache
result_metric.compute()

result.py 289 wrapped_func
self._computed = compute(*args, **kwargs)

result.py 251 compute
cumulated_batch_size = self.meta.sync(self.cumulated_batch_size)

ddp.py 342 reduce
return _sync_ddp_if_available(tensor, group, reduce_op=reduce_op)

distributed.py 172 _sync_ddp_if_available
return _sync_ddp(result, group=group, reduce_op=reduce_op)

distributed.py 222 _sync_ddp
torch.distributed.all_reduce(result, op=op, group=group, async_op=False)

c10d_logger.py 75 wrapper
return func(*args, **kwargs)

distributed_c10d.py 2219 all_reduce
work = group.allreduce([tensor], opts)

RuntimeError:
No backend type associated with device type cpu

Environment

Current environment

#- Lightning Component (e.g. Trainer, LightningModule, LightningApp, LightningWork, LightningFlow):
#- PyTorch Lightning Version (e.g., 1.5.0):
#- Lightning App Version (e.g., 0.5.2):
#- PyTorch Version (e.g., 2.0):
#- Python version (e.g., 3.9):
#- OS (e.g., Linux):
#- CUDA/cuDNN version:
#- GPU models and configuration:
#- How you installed Lightning(`conda`, `pip`, source):
#- Running environment of LightningApp (e.g. local, cloud):

More info

No response

awaelchli · Answer 1 · Sat Jul 27 2024 02:05:39 GMT+0800 (China Standard Time)

This was fixed by #19814