MNIST Hogwild on Apple Silicon

Question

MNIST Hogwild on Apple Silicon

jeffreykthomas opened this issue a year ago · comments

jeffreykthomas commented a year ago

Any help would be appreciated! Unable to run multiprocessing with mps device

Context

Pytorch version: 2.0.0.dev20221220
Operating System and version: macOS 13.1

Your Environment

Installed using source? [yes/no]: no
Are you planning to deploy it using docker container? [yes/no]: no
Is it a CPU or GPU environment?: Trying to use GPU
Which example are you using: MNIST Hogwild
Link to code or data to repro [if any]: https://github.com/pytorch/examples/tree/main/mnist_hogwild

Expected Behavior

Adding argument --mps should result in training with GPU

Current Behavior

Runtimeerror: share_filename: only available on CPU

Traceback (most recent call last):
  File "/Volumes/Main/pytorch/main.py", line 87, in <module>
    model.share_memory()  # gradients are allocated lazily, so they are not shared here
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2340, in share_memory
    return self._apply(lambda t: t.share_memory_())
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 784, in _apply
    module._apply(fn)
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 807, in _apply
    param_applied = fn(param)
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2340, in <lambda>
    return self._apply(lambda t: t.share_memory_())
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py", line 616, in share_memory_
    self._typed_storage()._share_memory_()
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/storage.py", line 701, in _share_memory_
    self._untyped_storage.share_memory_()
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/storage.py", line 209, in share_memory_
    self._share_filename_cpu_()
RuntimeError: _share_filename_: only available on CPU

Possible Solution

Steps to Reproduce

Clone repo
Run with --mps on Apple M1 Ultra
...

ewtang · Answer 1 · Sat Dec 31 2022 18:18:07 GMT+0800 (China Standard Time)

Please check arguments allowed:

minst % python main.py -h
usage: main.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N]
               [--lr LR] [--gamma M] [--no-cuda] [--no-mps] [--dry-run]
               [--seed S] [--log-interval N] [--save-model]

PyTorch MNIST Example

optional arguments:
  -h, --help           show this help message and exit
  --batch-size N       input batch size for training (default: 64)
  --test-batch-size N  input batch size for testing (default: 1000)
  --epochs N           number of epochs to train (default: 14)
  --lr LR              learning rate (default: 1.0)
  --gamma M            Learning rate step gamma (default: 0.7)
  --no-cuda            disables CUDA training
  --no-mps             disables macOS GPU training
  --dry-run            quickly check a single pass
  --seed S             random seed (default: 1)
  --log-interval N     how many batches to wait before logging training status
  --save-model         For Saving the current Model

jeffreykthomas · Answer 2 · Thu Jan 19 2023 03:28:53 GMT+0800 (China Standard Time)

Appreciate the response @ewtang, but I was trying the MNIST Hogwild, which I think is different the the PyTorch MNIST example, and thus has different arguments... I changed the title of the issue to be more specific.

ewtang · Answer 3 · Wed Jan 25 2023 12:26:17 GMT+0800 (China Standard Time)

Hi @jeffreykthomas, please check this: pytorch/pytorch#87688.

songqian · Answer 4 · Sat Dec 09 2023 16:31:58 GMT+0800 (China Standard Time)

same problem and I sloved by:

keep all multi-process processing job/function/param in CPU
move model/param/data to MPS in multi-process processing job/function by:
model.to(torch.device("mps"))
in_tensor.to(torch.device("mps"))

Mostly, multi-process communication needs to be implemented through other methods to ensure normal parameter updates.