apple / tensorflow_macos

TensorFlow for macOS 11.0+ accelerated using Apple's ML Compute framework.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

I Can't use GPU on m1 MacBook Pro

ElaheHkh opened this issue · comments

I am trying to use TensorFlow in the new MacBook Pro M1 but I can't find GPU. I tried to download and install https://github.com/apple/tensorflow_macos/releases manually and unmanually. This didn't work for me.
I confused 😔
Screen Shot 1400-01-26 at 04 34 32

Hey! Try to disable eager execution:

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

Then set the device to GPU.

Hey! Try to disable eager execution:

from tensorflow.python.framework.ops import disable_eager_execution
disable_eager_execution()

Then set the device to GPU.

I do it but don't work for me.

Hello, have you solved this issue? I have the same problem, the GPU is not working on my Mac M1.
Screen Shot 2021-04-15 at 8 17 08 PM

Here I have compared Google Colab and Mac M1 running time per each epoch, both on CPU:

  • Screen Shot 2021-04-15 at 8 55 13 PM
  • Screen Shot 2021-04-15 at 9 00 34 PM

Here I have compared Google Colab and Mac M1 running time per each epoch, both on CPU:

  • Screen Shot 2021-04-15 at 8 55 13 PM
  • Screen Shot 2021-04-15 at 9 00 34 PM

I can run Keras on GPU but not torch
you can check GPU usage by activity monitor
Screen Shot 1400-01-26 at 21 26 53

  • Before running the code:
    Screen Shot 2021-04-15 at 9 56 47 PM
  • During training:
    Screen Shot 2021-04-15 at 9 57 47 PM

As you see only CPU is being loaded. @ElaheHkh how have you enabled the GPU?

I download tensorflow_macos
https://github.com/apple/tensorflow_macos/releases and move it to /user/
then try run these instructions:
%tar xvzf tensorflow_macos-${VERSION}.tar
% cd tensorflow_macos
% ./install_venv.sh --prompt
cd
cd tensorflow_macos
bash install_venv.sh --prompt

conda install -c conda-forge -y absl-py
conda install -c conda-forge -y astunparse
conda install -c conda-forge -y gast
conda install -c conda-forge -y opt_einsum
conda install -c conda-forge -y termcolor
conda install -c conda-forge -y typing_extensions
conda install -c conda-forge -y wheel
conda install -c conda-forge -y typeguard

pip install --upgrade --no-dependencies --force grpcio-1.33.2-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force h5py-2.10.0-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force numpy-1.18.5-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force tensorflow_addons_macos-0.1a3-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force tensorflow_macos-0.1a3-cp38-cp38-macosx_11_0_arm64.whl

pip install --upgrade --no-dependencies --force tensorflow_addons-0.11.2+mlcompute-cp38-cp38-macosx_11_0_arm64.whl

pip install pyopencl
pip install --upgrade google-api-python-client
pip install absl-py
pip install wrapt
pip install monotonic
pip install netifaces
pip install astunparse
pip install flatbuffers
pip install gast
pip install google_pasta

pip install keras_preprocessing
pip install opt_einsum
pip install protobuf
pip install tensorflow_estimator
pip install termcolor
pip install typing_extensions
pip install wheel
pip install tensorboard
pip install typeguard
pip install tqdm
conda install torchvision -c pytorch
pip install tensorflow_datasets
pip3 install git+https://github.com/geohot/tinygrad.git --upgrade

pip install tensorboard
pip install cython
git clone https://github.com/pandas-dev/pandas.git
cd pandas
python3 setup.py install
pip install ipywidgets
conda update -n base conda
conda install pytorch torchvision -c pytorch

I can run Keras on GPU but not torch

So you got tf.keras using the GPU working? Can you please run this one and make a screenshot of the GPU load?

import os
#os.environ["TF_DISABLE_MLC"] = "1"
#os.environ["TF_MLC_LOGGING"] = "1"

from tensorflow.python.compiler.mlcompute import mlcompute
#mlcompute.set_mlc_device(device_name='gpu')
print("is_apple_mlc_enabled %s" % mlcompute.is_apple_mlc_enabled())
print("is_tf_compiled_with_apple_mlc %s" % mlcompute.is_tf_compiled_with_apple_mlc())

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
tf.compat.v1.disable_eager_execution()

print(f"eagerly? {tf.executing_eagerly()}")
print(tf.config.list_logical_devices())

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

Considering the device list I found this post of an apple contributor.

I can run Keras on GPU but not torch

So you got tf.keras using the GPU working? Can you please run this one and make a screenshot of the GPU load?

import os
#os.environ["TF_DISABLE_MLC"] = "1"
#os.environ["TF_MLC_LOGGING"] = "1"

from tensorflow.python.compiler.mlcompute import mlcompute
#mlcompute.set_mlc_device(device_name='gpu')
print("is_apple_mlc_enabled %s" % mlcompute.is_apple_mlc_enabled())
print("is_tf_compiled_with_apple_mlc %s" % mlcompute.is_tf_compiled_with_apple_mlc())

import tensorflow as tf
from tensorflow.keras import datasets, layers, models
tf.compat.v1.disable_eager_execution()

print(f"eagerly? {tf.executing_eagerly()}")
print(tf.config.list_logical_devices())

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

I'm troubleshooting the same issue and just ran this for fun. On a fresh install of this fork of TF2, my M1 Mac mini uses 21% of CPU and 11% GPU.

is_tf_compiled_with_apple_mlc True
eagerly? False
[LogicalDevice(name='/device:CPU:0', device_type='CPU')]
Downloading data from https://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
170500096/170498071 [==============================] - 132s 1us/step
Train on 50000 samples, validate on 10000 samples
Epoch 1/10
49952/50000 [============================>.] - ETA: 0s - loss: 1.5267 - accuracy: 0.4439
50000/50000 [==============================] - 11s 223us/sample - loss: 1.5263 - accuracy: 0.4440 - val_loss: 1.2340 - val_accuracy: 0.5560
Epoch 2/10
50000/50000 [==============================] - 11s 220us/sample - loss: 1.1626 - accuracy: 0.5901 - val_loss: 1.0764 - val_accuracy: 0.6136
Epoch 3/10
50000/50000 [==============================] - 11s 222us/sample - loss: 1.0211 - accuracy: 0.6416 - val_loss: 1.0145 - val_accuracy: 0.6461
Epoch 4/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.9212 - accuracy: 0.6781 - val_loss: 0.9443 - val_accuracy: 0.6719
Epoch 5/10
50000/50000 [==============================] - 11s 222us/sample - loss: 0.8490 - accuracy: 0.7023 - val_loss: 0.9732 - val_accuracy: 0.6604
Epoch 6/10
50000/50000 [==============================] - 11s 222us/sample - loss: 0.7897 - accuracy: 0.7226 - val_loss: 0.9129 - val_accuracy: 0.6858
Epoch 7/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.7419 - accuracy: 0.7397 - val_loss: 0.9174 - val_accuracy: 0.6886
Epoch 8/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.7016 - accuracy: 0.7530 - val_loss: 0.8932 - val_accuracy: 0.6997
Epoch 9/10
50000/50000 [==============================] - 11s 221us/sample - loss: 0.6640 - accuracy: 0.7677 - val_loss: 0.8965 - val_accuracy: 0.6969
Epoch 10/10
50000/50000 [==============================] - 11s 223us/sample - loss: 0.6317 - accuracy: 0.7792 - val_loss: 0.8599 - val_accuracy: 0.7099```

@ManuelSchneid3r Ran your code on my M1 MBA 8/512 and here are the CPU/GPU usages:
Screenshot 2021-05-09 at 9 34 15 PM

@ManuelSchneid3r Edited your code to enable GPU, here are the charts:
Screenshot 2021-05-09 at 9 43 36 PM

CPU per epoch time = 13s, GPU per epoch time = 10s

Can you share that edited code or any other pointers you used to actually get GPU to run quickly? It seems like you've accomplished what a lot of us have been having trouble with: 1) using a very low amount of CPU and almost no GPU, or using entirely GPU but with speeds a few orders of magnitude slower than just using CPU in eager mode.

When I disable eager mode and set device to GPU, it would take probably a week to run that code. Im very curious about what you did differently.

Really appreciate the help, this is huge!

Here's the code I use to run with GPU:

#import os
#os.environ["TF_DISABLE_MLC"] = "1"
#os.environ["TF_MLC_LOGGING"] = "1"
import tensorflow as tf
from tensorflow.python.compiler.mlcompute import mlcompute

tf.compat.v1.disable_eager_execution()
mlcompute.set_mlc_device(device_name='gpu')
print("is_apple_mlc_enabled %s" % mlcompute.is_apple_mlc_enabled())
print("is_tf_compiled_with_apple_mlc %s" % mlcompute.is_tf_compiled_with_apple_mlc())
print(f"eagerly? {tf.executing_eagerly()}")
print(tf.config.list_logical_devices())

from tensorflow.keras import datasets, layers, models

(train_images, train_labels), (test_images, test_labels) = datasets.cifar10.load_data()
train_images, test_images = train_images / 255.0, test_images / 255.0
class_names = ['airplane', 'automobile', 'bird', 'cat', 'deer',
               'dog', 'frog', 'horse', 'ship', 'truck']
model = models.Sequential()
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(32, 32, 3)))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10))
model.compile(optimizer='adam',
              loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
              metrics=['accuracy'])
history = model.fit(train_images, train_labels, epochs=10,
                    validation_data=(test_images, test_labels))

The only thing I did was to uncomment the mlcompute set GPU statement and reorder some lines for readability.
The output is as shown:

(m1) $ python tf_m1_test.py 
is_apple_mlc_enabled True
is_tf_compiled_with_apple_mlc True
eagerly? False
[LogicalDevice(name='/device:CPU:0', device_type='CPU')]
Train on 50000 samples, validate on 10000 samples
2021-05-09 21:48:18.948717: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:196] None of the MLIR optimization passes are enabled (registered 0 passes)
2021-05-09 21:48:18.953797: W tensorflow/core/platform/profile_utils/cpu_utils.cc:126] Failed to get CPU frequency: 0 Hz
Epoch 1/10
49856/50000 [============================>.] - ETA: 0s - loss: 1.5445 - accuracy: 0.4369/Users/dotw/miniforge3/envs/m1/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:2325: UserWarning: `Model.state_updates` will be removed in a future version. This property should not be used in TensorFlow 2.0, as `updates` are applied automatically.
  warnings.warn('`Model.state_updates` will be removed in a future version. '
50000/50000 [==============================] - 12s 247us/sample - loss: 1.5441 - accuracy: 0.4371 - val_loss: 1.3115 - val_accuracy: 0.5319
Epoch 2/10
50000/50000 [==============================] - 10s 197us/sample - loss: 1.1641 - accuracy: 0.5869 - val_loss: 1.0943 - val_accuracy: 0.6138
Epoch 3/10
50000/50000 [==============================] - 10s 198us/sample - loss: 1.0016 - accuracy: 0.6446 - val_loss: 0.9661 - val_accuracy: 0.6611
Epoch 4/10
50000/50000 [==============================] - 10s 196us/sample - loss: 0.8969 - accuracy: 0.6850 - val_loss: 0.9667 - val_accuracy: 0.6652
Epoch 5/10
50000/50000 [==============================] - 10s 194us/sample - loss: 0.8259 - accuracy: 0.7120 - val_loss: 0.9193 - val_accuracy: 0.6825
Epoch 6/10
50000/50000 [==============================] - 10s 194us/sample - loss: 0.7671 - accuracy: 0.7302 - val_loss: 0.8992 - val_accuracy: 0.6850
Epoch 7/10
50000/50000 [==============================] - 10s 198us/sample - loss: 0.7197 - accuracy: 0.7490 - val_loss: 0.9441 - val_accuracy: 0.6805
Epoch 8/10
50000/50000 [==============================] - 10s 197us/sample - loss: 0.6788 - accuracy: 0.7626 - val_loss: 0.8432 - val_accuracy: 0.7115
Epoch 9/10
50000/50000 [==============================] - 10s 198us/sample - loss: 0.6453 - accuracy: 0.7743 - val_loss: 0.8417 - val_accuracy: 0.7177
Epoch 10/10
50000/50000 [==============================] - 10s 204us/sample - loss: 0.6058 - accuracy: 0.7866 - val_loss: 0.8771 - val_accuracy: 0.7122

I am experiencing similar results with ML Compute, although on an Intel-based MacBookPro. (See now closed issue #256 for background. After being pointed to this issue (#235), I decided to run ongtw's code above.

Function tf.config.list_logical_devices() reports code is running on the CPU [LogicalDevice(name='/device:CPU:0', device_type='CPU')], as does the debugger. However, the macOS Activity Monitor suggests differently.

These two images are of a "resting state," prior to running to running ongtw's script.

Screen Shot 2021-05-10 at 11 56 12 AM

Screen Shot 2021-05-10 at 11 25 33 AM

The following two images were taken immediately after completing execution of ongtw's script:
Screen Shot 2021-05-10 at 11 31 18 AM

Screen Shot 2021-05-10 at 11 31 25 AM

The images show full utilization of the AMD device and elevated usage of Cores 1 and 3 compared to the baseline state. Total execution time was 275 seconds.

I ran a second test under eager execution, which, per documentation, requires and automatically selects cpu processing. The two images below display the history of this run. GPU usage is similar, but CPU load is higher. Total execution time of 300 seconds.

Screen Shot 2021-05-10 at 12 18 35 PM

Screen Shot 2021-05-10 at 12 18 30 PM

My preliminary conclusions are 1) the GPU is being used in both use cases, regardless of the reported device and 2) selecting the CPU, as in the second run, seems to increase usage.

Are my conclusions valid and, more importantly, is the documented GPU/CPU usage intended?

Thanks.

@DLWCMD The GPU missing in the device list is a known issue, already mentioned above. When I referenced this issue, I was not aware that you use Intel. You said that you "Cannot Set Device" and "regardless of settings (either 'gpu' or 'any'), the code is run on my CPU". This is why I linked you here. Well Now your GPU seem to work and your are using Intel…

First, thanks very much for your attention to this issue and your quick responses. Also, I see by your link above that cpu is shown as device, even if gpu is selected and being used. I experience the same behavior on my intel-base system.

However, as shown in my comment above, when I enable eager execution, which forces the cpu to be selected, the gpu is also engaged. Is this the desired behavior?

I ran the CNN test script in my (non-ML Compute) Conda environment (TF 2.4.1/Python 9.8.8) with a comparable execution time to ML Compute (+- 280 seconds). Activity monitor confirmed heavy use of all eight cores and no GPU activity.

By contrast, in my ML Compute environment, as shown above, the GPU is fully engaged, but supported by only four cores.

So, in this scenario at least, GPU + four cores is roughly equivalent to eight cores without ML Compute. Would you expect this on my system?

Thanks.

The plot thickens on this issue. When I run the code in #235 (comment) on my M1 Mac mini, it runs as expected, with full GPU activity on my activity monitor. But then when I use the same settings of disabling eager execution and specifying GPU when working with a simpler model (large tabular database with fewer layers, all densely connected), I get the behavior many others have noticed, with my models training very slowly, only using a small amount of CPU and no GPU. It seems like TF is ignoring the message to use GPU, and is just running on CPU without eager execution which is very slow. I don't understand why, since the CNN above runs as expected but my own model doesn't. I would post

@arge-7 Interesting, if you don't mind sharing your code, I would be curious to run it on my M1 MBA and see if I get the same effect as you do.

The data I'm working with is protected health information, but here's a link to a Jupyter notebook that generates a similar synthetic database (10,000 rows, 1000 binary categorical columns) and a continuous target variable to predict as a regression problem, like charge of hospitalization for example. The notebook also contains code to create and train TF models on this synthetic data. In making this notebook to post here, I found something interesting. There are two models built and trained in the notebook. The first has hidden Dense layer sizes of 1000, and the second has hidden layer sizes of 10,000. The only difference is these two layers being an order of magnitude different. When I run the code for the first one, my GPU kicks in and it runs as expected. However, with the 10,000 layer size, the GPU stays quiet and my CPU tries to handle it while only using 20% of its activity. Weird.

https://github.com/arge-7/NIS/blob/main/make_dummy_data.ipynb

edit: I played around with this more and may have gotten a little more insight on this. On my original code that I can't share, I had done some feature engineering with sklearn StandardScaler and PolynomialFeatures. I realized that the order I had them in didn't make sense - I was scaling and then adding the polynomial feature transformation. With only one change to my code, changing the order around so that I added poly features and then scaled the data, it switched from wimpy CPU to full power GPU. So it seems like models that are particularly large or complicated or contain more variable input data are prone to running on CPU. It almost seems like a memory issue, although my memory use isn't impressive in either case. This is with 16 GB of ram in my Mac mini.

@arge-7 I ran your Jupyter notebook code. Indeed, the second model ran very slowly. Here's why: It is actually using the GPU but its huge size causes a lot of swapping which kills the performance. See my CPU/GPU charts below:

cpu_gpu

@ongtw Yep you're right, I can reproduce that. As I decreased the size of the hidden layers in the large model, I got to a point where the GPU was starting to show more activity. When I continued to gradually decrease layer sizes, the GPU activity increased while the CPU activity decreased. My activity monitor memory stats don't seem to represent this though. Even with a massive model slowed down by the swapping, my monitor shows minimal memory pressure, around 11 GB used, 5 GB free, and around 750 MB of swap used.

I just noticed that this issue is addressed at the bottom of the readme where it describes this as paging. I also experimented with the TF_MLC_LOGGING command in the readme to compare the outputs of the reasonable vs huge models, but I didn't see any errors or even any big differences between the outputs in the terminal. They both confirm that ML Compute is using the GPU, even when it doesn't appear to due to the memory paging.

I see that the official TF repo has ways to try to limit this. I was going to try using the Apple implementation with TF_MLC_ALLOCATOR_INIT_VALUE and report my results.

Hi, just a quick word to let people know, even though it's a bit off topic, that on my macbook pro 15'' 2018, TF+ml compute seems to actually uses the GPU. Benchs I've run show that the gain speed is between x2 and x3 (depending on the code) compared to the cpu alone. It's really noticable with the activity monitor.

mnist bench on GPU
Training set contained 60000 images
Testing set contained 10000 images
Model achieved 0.88 testing accuracy
Training and testing took 48.41 seconds

mnist bench on CPU
Training set contained 60000 images
Testing set contained 10000 images
Model achieved 0.88 testing accuracy
Training and testing took 143.81 seconds

I'm really looking forward to getting a mbpro 16' M1 when they're out :)

here's the script that I used (with export TF_XLA_FLAGS=--tf_xla_enable_xla_devices)
http://pauillac.inria.fr/~seddah/fashin_mnist.py

Djamé

u!pdate: once the mac gets too hot, it seems to revert back to CPU, which of course are now running at 23% of their frequency, and the mac is barely responding. Can someone tell me if the m1 gets hot when using the gpu ?

Djamé

@dseddah no it doesn’t. Even with near 100% GPU utilization, my temp stays below 130 F. Fans don’t even turn on usually. That’s the beauty of the M1 though.

thanks. that's going the main reason I'm gonna get one. I can't stand those fans anymore. Weirdly, it wasn't as annoying with Mojave but since I installed Big Sur, everything got weird: unexplicable slow down, constant overheating, etc...

@dseddah yeah, I had been using a maxed out 16 inch MacBook Pro (obviously Intel) and I just couldn’t wait to try the M1 chips any longer so I got a Mac mini to play with. I can’t even use the MacBook Pro anymore just for psychological reasons, because it’s such a bad feeling using a machine that’s four times the cost of the mini, a quarter of the power, hot to the touch, and fans at full blast. You can’t go back to using an Intel machine after using Apple silicon.

There are lots of rumors going around that the next iteration of the chip for the next generation of MacBook pros is right around the corner.

@dseddah My M1 MacBook Air 8/512 does not even have a fan. 🙂
My advice is to max out the RAM since there is no way around this if you want to run large models, not even with Apple's Unified Memory Architecture.