MNIST Hogwild on Apple Silicon
jeffreykthomas opened this issue · comments
Any help would be appreciated! Unable to run multiprocessing with mps device
Context
- Pytorch version: 2.0.0.dev20221220
- Operating System and version: macOS 13.1
Your Environment
- Installed using source? [yes/no]: no
- Are you planning to deploy it using docker container? [yes/no]: no
- Is it a CPU or GPU environment?: Trying to use GPU
- Which example are you using: MNIST Hogwild
- Link to code or data to repro [if any]: https://github.com/pytorch/examples/tree/main/mnist_hogwild
Expected Behavior
Adding argument --mps should result in training with GPU
Current Behavior
Runtimeerror: share_filename: only available on CPU
Traceback (most recent call last):
File "/Volumes/Main/pytorch/main.py", line 87, in <module>
model.share_memory() # gradients are allocated lazily, so they are not shared here
File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2340, in share_memory
return self._apply(lambda t: t.share_memory_())
File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 784, in _apply
module._apply(fn)
File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 807, in _apply
param_applied = fn(param)
File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2340, in <lambda>
return self._apply(lambda t: t.share_memory_())
File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py", line 616, in share_memory_
self._typed_storage()._share_memory_()
File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/storage.py", line 701, in _share_memory_
self._untyped_storage.share_memory_()
File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/storage.py", line 209, in share_memory_
self._share_filename_cpu_()
RuntimeError: _share_filename_: only available on CPU
Possible Solution
Steps to Reproduce
- Clone repo
- Run with --mps on Apple M1 Ultra
...
Please check arguments allowed:
minst % python main.py -h
usage: main.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N]
[--lr LR] [--gamma M] [--no-cuda] [--no-mps] [--dry-run]
[--seed S] [--log-interval N] [--save-model]
PyTorch MNIST Example
optional arguments:
-h, --help show this help message and exit
--batch-size N input batch size for training (default: 64)
--test-batch-size N input batch size for testing (default: 1000)
--epochs N number of epochs to train (default: 14)
--lr LR learning rate (default: 1.0)
--gamma M Learning rate step gamma (default: 0.7)
--no-cuda disables CUDA training
--no-mps disables macOS GPU training
--dry-run quickly check a single pass
--seed S random seed (default: 1)
--log-interval N how many batches to wait before logging training status
--save-model For Saving the current Model
Appreciate the response @ewtang, but I was trying the MNIST Hogwild, which I think is different the the PyTorch MNIST example, and thus has different arguments... I changed the title of the issue to be more specific.
Hi @jeffreykthomas, please check this: pytorch/pytorch#87688.
same problem and I sloved by:
- keep all multi-process processing job/function/param in CPU
- move model/param/data to MPS in multi-process processing job/function by:
model.to(torch.device("mps"))
in_tensor.to(torch.device("mps"))
Mostly, multi-process communication needs to be implemented through other methods to ensure normal parameter updates.