MPS tensor fails in nn_embedding with wrong error code

Question

MPS tensor fails in nn_embedding with wrong error code

deppemj opened this issue a year ago · comments

First github item ever, sorry if this is incorrect.

I'm attempting to recreate Karpathy's transformer network and can get it working with torch in R:
link to video

When I attempt to speed it up with my M1 MBP, I can see all my different tensors in the MPS float type:
[ MPSFloatType{256,256} ][ requires_grad = TRUE ]

I also get torch::backends_mps_is_available == TRUE so I know that my mbp will work here.

I have set my model environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 but I continue to get this same error. The link in the error takes me to the main mps page and lists the item as complete. It shows that this ops was implemented by Daniel himself and his mac file path also shows in the error but I don't know github well enought to trace those errors all the way up the tree.

If this is incorrectly posted or I made an error, I again apologize as this is a new hobby I'm getting into and I'm not a dev by trade. I could also not know enough to see if this just isn't implemented yet and I'm incorrectly pulling an error on code that's still in the works. I'm very willing to help contribute to this issue if provided direction but I don't know very much about code so I'm afraid I would put myself in over my head.

Error message, the attachment has the traceback:
Error in (function (weight, indices, padding_idx, scale_grad_by_freq, :
The operator 'aten::signbit.out' is not currently implemented for the MPS device. If you want this op to be added in priority during the prototype phase of this feature, please comment on pytorch/pytorch#77764. As a temporary fix, you can set the environment variable PYTORCH_ENABLE_MPS_FALLBACK=1 to use the CPU as a fallback for this op. WARNING: this will be slower than running natively on MPS.
Exception raised from mps_error_fallback at /Users/dfalbel/Documents/actions-runner/mlverse-m1/_work/libtorch-mac-m1/libtorch-mac-m1/pytorch/aten/src/ATen/mps/MPSFallback.mm:22 (most recent call first):
frame #0: at::mps_error_fallback(c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocatorc10::IValue>) + 212 (0x17aba2fbc in libtorch_cpu.dylib)
frame #1: void c10::BoxedKernel::make_boxed_function<&at::mps_fallback(c10::OperatorHandle const&, std::__1::vector<c10::IValue, std::__1::allocatorc10::IValue>)>(c10::OperatorKernel*, c10:

full_error.txt

Daniel Falbel · Answer 1 · Thu Jul 20 2023 21:07:35 GMT+0800 (China Standard Time)

HI @deppemj ,

Thanks for reporting. Could you provide a little code snippet that raises this error?
This is indeed not expected as we definitely use nn_embedding layers in other contexts and have ran it on MPS devices.
For instance:

> embedding <- nn_embedding(10, 32)$to(device="mps")
> y <- torch_randint(1, 10, size = 10)$to(dtype='int', device="mps")
> embedding(y)
[W MPSFallback.mm:11] Warning: The operator 'aten::signbit.out' is not currently supported on the MPS backend and will fall back to run on the CPU. This may have performance implications. (function operator())
torch_tensor
Columns 1 to 10-0.1047  0.5381  0.9465  0.5106 -0.2772  1.3855  0.4091  0.3425  0.1102  0.4185
-0.3127 -1.5091 -1.0605 -0.8537 -0.2804 -0.4280  0.5683 -1.8864  0.0713  0.3933
-1.5547  1.0484 -1.6178 -1.8120 -0.9372 -0.3467  1.0830  1.2754 -2.2991 -2.9057
 0.0588 -0.5669  0.3621 -1.9154  2.1674 -0.1544  0.6144  0.1823 -0.0868 -1.5684
 2.7622  1.6805 -0.4312 -0.4746  0.5691 -0.0334  0.0998 -1.4412 -0.6447 -0.0889
-0.2622  0.1385  0.0755  0.2418 -0.7958 -1.5408  0.0668 -0.5806  0.8378  0.6069
 0.4019 -0.8076 -1.5268 -0.2339 -0.1726 -0.5274  0.5157  1.4383 -1.6282 -0.2371
-0.6276 -0.7738  0.1915  0.2061  0.3176 -1.4719 -2.1347  0.2900 -1.2652 -0.1229
-0.1047  0.5381  0.9465  0.5106 -0.2772  1.3855  0.4091  0.3425  0.1102  0.4185
-0.6276 -0.7738  0.1915  0.2061  0.3176 -1.4719 -2.1347  0.2900 -1.2652 -0.1229

Columns 11 to 20 0.2461 -0.9786  2.0182  1.5225 -0.5319  0.4745  0.6939  1.7658 -1.2444  0.6902
-0.1907  0.3994  1.9738 -1.0829  0.9233 -0.8986  0.4413  0.6635  1.4770 -0.9239
-0.3905 -0.2486 -0.0924  0.6239 -0.0630 -0.5861 -0.1037  0.9312  0.1882  1.3699
 0.5544 -0.2131  2.1383 -0.2043  0.6309 -1.1926  2.3989  0.7244 -0.0018  0.3362
 0.7716  0.8150 -0.6627 -1.3166 -0.7178 -1.0793  0.5535 -0.2834 -0.2037 -0.2363
 0.3713 -0.2135  2.2387  0.3831  1.0139 -0.9339 -0.1627 -0.1378 -0.0914  1.0594
-3.8547  1.1019  0.7977 -0.2849  0.2517  2.0009  0.7167  2.4329 -0.6236 -1.3862
 1.7182 -0.5215  0.3491 -1.3711 -0.9323  0.3547 -0.0533  0.5846 -0.6284 -2.0010
 0.2461 -0.9786  2.0182  1.5225 -0.5319  0.4745  0.6939  1.7658 -1.2444  0.6902
 1.7182 -0.5215  0.3491 -1.3711 -0.9323  0.3547 -0.0533  0.5846 -0.6284 -2.0010

Columns 21 to 30 1.6911  0.2855  1.6804  0.3766 -0.8043 -0.1542 -0.3814 -1.1207 -1.5666  0.8512
 1.3296 -0.5811  1.5949  0.4076 -0.0072  1.8948 -0.2623 -0.8181 -1.2143 -0.8192
 0.3720  1.1429 -0.6983 -0.0714  0.4232  0.2405  1.7843 -0.4306  0.2451 -0.5087
 0.4996 -1.0870 -0.6207  0.3139  0.0044 -1.0243 -0.7959 -0.5902 -0.0128  0.7566
 1.3716  2.4713 -0.4472 -0.7671 -0.1087 -1.7167  0.1790  0.8206 -1.2912 -0.1824
-1.2127 -0.5383 -0.2472 -1.3087  0.8679 -0.5583  0.5441  0.3513 -0.1868  0.5586
 0.6093  1.6963 -2.8325 -1.0384 -1.0545  0.7049  1.0245 -0.3509  1.1126 -0.9042
 0.1676  0.3383 -1.6388 -0.4299 -0.6024  1.0387 -0.9491 -0.3108  0.2988 -0.1164
... [the output was truncated (use n=-1 to disable)]
[ MPSFloatType{10,32} ][ grad_fn = <EmbeddingBackward0> ]

deppemj · Answer 2 · Fri Jul 21 2023 00:48:32 GMT+0800 (China Standard Time)

This code recreates the issue. The dataset has just been substituted by the sequence, hopefully that isn't a problem.

train_data_mps <- torch_tensor(seq(1:182771),device="mps",dtype=torch_long())

ix <- torch_randint(1, length(train_data_mps) - 256, c(32),device = "mps")

xb <- torch_stack(sapply(as.integer(ix), function(i) train_data_mps[i:(i+256-1)]))

sample_embedding <- nn_embedding(num_embeddings = 256,embedding_dim = 256)
tok_emb <-sample_embedding(torch_tensor(xb,device="mps"))

Daniel Falbel · Answer 3 · Fri Jul 21 2023 01:05:20 GMT+0800 (China Standard Time)

I am not sure I follow what exactly the dataset is. But you must move the nn_modules to the MPS device before using them with MPS tensors (such as xb), eg:

sample_embedding <- nn_embedding(num_embeddings = 256,embedding_dim = 256)$to(device="mps")