Error with model parallelism

Question

Error with model parallelism

clessig opened this issue a year ago · comments

Hi,

I recently implemented model parallelism. With it, however, the code fails in

torchinfo.summary( self.model, input_data=[batch_data])

with "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! ". Running the model with test() and train() works without problem.

From the behavior it seems that torchinfo copies the network in the back to the same device? Is this the case? Is there a way to use torchinfo with model parallelism?

Thanks!

Sebastian Müller · Answer 1 · Sun Feb 12 2023 00:15:23 GMT+0800 (China Standard Time)

Yes, currently the input_tensor (or auto-generated tensor from input_shape) as well as the model are moved to whatever device the model is on, unless a device is given. However, this works by finding the device of the first parameter and then moving everything to it; fixing this issue will require a bit of a redesign.

I'll look into it. Could you provide a test-case? I'm not very familiar with doing model-parallelism on PyTorch, so that would really help :)

Christian Lessig · Answer 2 · Sun Feb 12 2023 15:59:11 GMT+0800 (China Standard Time)

Hi Sebastian,

Thanks for the quick reply. To begin with, an error message would be nice--it took me an hour to track down that torchinfo might not support model parallelism.

I am a bit time-strapped at the moment. But any test-case that I could produce wouldn't look different than the official pytorch model parallelism toy example: https://pytorch.org/tutorials/intermediate/model_parallel_tutorial.html. So if you use this and add torchinfo you should be able to reproduce the issue.

Best,
Christian

Tyler Yep · Answer 3 · Tue Mar 14 2023 03:16:25 GMT+0800 (China Standard Time)

This has been fixed, should go out in v1.8.0