pytorch / examples

A set of examples around pytorch in Vision, Text, Reinforcement Learning, etc.

Home Page:https://pytorch.org/examples

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MNIST Hogwild on Apple Silicon

jeffreykthomas opened this issue · comments

Any help would be appreciated! Unable to run multiprocessing with mps device

Context

  • Pytorch version: 2.0.0.dev20221220
  • Operating System and version: macOS 13.1

Your Environment

  • Installed using source? [yes/no]: no
  • Are you planning to deploy it using docker container? [yes/no]: no
  • Is it a CPU or GPU environment?: Trying to use GPU
  • Which example are you using: MNIST Hogwild
  • Link to code or data to repro [if any]: https://github.com/pytorch/examples/tree/main/mnist_hogwild

Expected Behavior

Adding argument --mps should result in training with GPU

Current Behavior

Runtimeerror: share_filename: only available on CPU

Traceback (most recent call last):
  File "/Volumes/Main/pytorch/main.py", line 87, in <module>
    model.share_memory()  # gradients are allocated lazily, so they are not shared here
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2340, in share_memory
    return self._apply(lambda t: t.share_memory_())
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 784, in _apply
    module._apply(fn)
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 807, in _apply
    param_applied = fn(param)
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/nn/modules/module.py", line 2340, in <lambda>
    return self._apply(lambda t: t.share_memory_())
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py", line 616, in share_memory_
    self._typed_storage()._share_memory_()
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/storage.py", line 701, in _share_memory_
    self._untyped_storage.share_memory_()
  File "/Users/jeffreythomas/opt/anaconda3/envs/pytorch/lib/python3.9/site-packages/torch/storage.py", line 209, in share_memory_
    self._share_filename_cpu_()
RuntimeError: _share_filename_: only available on CPU

Possible Solution

Steps to Reproduce

  1. Clone repo
  2. Run with --mps on Apple M1 Ultra
    ...
commented

Please check arguments allowed:

minst % python main.py -h
usage: main.py [-h] [--batch-size N] [--test-batch-size N] [--epochs N]
               [--lr LR] [--gamma M] [--no-cuda] [--no-mps] [--dry-run]
               [--seed S] [--log-interval N] [--save-model]

PyTorch MNIST Example

optional arguments:
  -h, --help           show this help message and exit
  --batch-size N       input batch size for training (default: 64)
  --test-batch-size N  input batch size for testing (default: 1000)
  --epochs N           number of epochs to train (default: 14)
  --lr LR              learning rate (default: 1.0)
  --gamma M            Learning rate step gamma (default: 0.7)
  --no-cuda            disables CUDA training
  --no-mps             disables macOS GPU training
  --dry-run            quickly check a single pass
  --seed S             random seed (default: 1)
  --log-interval N     how many batches to wait before logging training status
  --save-model         For Saving the current Model

Appreciate the response @ewtang, but I was trying the MNIST Hogwild, which I think is different the the PyTorch MNIST example, and thus has different arguments... I changed the title of the issue to be more specific.

commented

Hi @jeffreykthomas, please check this: pytorch/pytorch#87688.

same problem and I sloved by:

  1. keep all multi-process processing job/function/param in CPU
  2. move model/param/data to MPS in multi-process processing job/function by:
    model.to(torch.device("mps"))
    in_tensor.to(torch.device("mps"))

Mostly, multi-process communication needs to be implemented through other methods to ensure normal parameter updates.