pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.

Home Page:https://pytorch.org/examples

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

torch.multiprocessing subprocess receives tensor with zeros rather than actual data

dfarhi opened this issue · comments

Your issue may already be reported!
Please search on the issue tracker before creating one.

Context

th.multiprocessing seems to not send tensor data to spawned processes on my setup.

  • Pytorch version:
torch==1.11.0+cu113
torchaudio==0.11.0+cu113
torchvision==0.12.0+cu113
  • Operating System and version: Windows 10 version 21H1
  • Cuda 11.7

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: GPU
  • Which example are you using: mnist_hogwild
  • Link to code or data to repro [if any]:

Expected Behavior

Insert a print into the start of train.train to check the parameter has been copied to subprocess correctly:

    print(f"Norm was: {model.fc1.weight.norm().item()}")

The above print should print some random number. When I run without cuda, it does so:

>python main.py
Norm was: 4.082266807556152
Norm was: 4.081115245819092
... [training begins]

Current Behavior

When I run with cuda the tensor is zero:

>python main.py --cuda
Norm was: 0.0
Norm was: 0.0
... [training begins]

Repro

I think this is not a problem with the example but a problem with the base torch.multiprocesssing, or a problem with my installation. The issue seems to be that any tensors sent to a subprocess have their data replaced with zeros.

I've put above the steps to reproduce this issue in the mnist_hogwild example (the steps are just "run it on cuda on my device").

As an even more minimal repro, this also fails for me:

import torch as th
import torch.multiprocessing as mp

if __name__ == "__main__":
    parameter = th.randn(1, device='cuda:0')

    print(parameter)  # here parameter is a 1x1 tensor with a random number
    mp.set_start_method("spawn")

    p = mp.Process(target=print, args=(parameter,))  # here parameter is a 1x1 zero tensor.
    p.start()
    p.join()

[Edited to simplify repro code]

Hi @dfarhi , I'm not able to reproduce it with torch==1.12.1+cu102 in Ubuntu 22.04 LTS. Is it still reproducible on your side?