beautiful_mnist.py: RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery)"
evilsocket opened this issue · comments
Related to #3857 ... I'm getting this even after rebooting, quite systematically despite what's reported in #3857.
Using 6ec7dbc@master (but it happens also with the latest packaged version), macOS 14.5, Apple M1 Max on a MacBook Pro 2021:
cd tinygrad && python3.11 -m pip install -e . && python3.11 examples/beautiful_mnist.py
...
...
loss: 1.14 test_accuracy: nan%: 4%|█████ | 3/70 [00:01<00:42, 1.57it/s]Traceback (most recent call last):
File "/Users/evilsocket/tinygrad/examples/beautiful_mnist.py", line 44, in <module>
t.set_description(f"loss: {loss.item():6.2f} test_accuracy: {test_acc:5.2f}%")
^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 3122, in _wrapper
ret = fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 273, in item
return self._data().cast(self.dtype.fmt)[0]
^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 3101, in _wrapper
if _METADATA.get() is not None: return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 244, in _data
cpu = self.cast(self.dtype.scalar()).contiguous().to("CLANG").realize()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 3101, in _wrapper
if _METADATA.get() is not None: return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 203, in realize
run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 202, in run_schedule
ei.run(var_vals, do_update_stats=do_update_stats)
File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 154, in run
et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 116, in __call__
self.copy(dest, src)
File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 111, in copy
dest.copyin(src.as_buffer(allow_zero_copy=True)) # may allocate a CPU buffer depending on allow_zero_copy
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/device.py", line 110, in as_buffer
if (force_zero_copy or allow_zero_copy) and hasattr(self.allocator, 'as_buffer'): return self.allocator.as_buffer(self._buf)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/runtime/ops_metal.py", line 91, in as_buffer
self.device.synchronize()
File "/Users/evilsocket/tinygrad/tinygrad/runtime/ops_metal.py", line 108, in synchronize
for cbuf in self.mtl_buffers_in_flight: wait_check(cbuf)
^^^^^^^^^^^^^^^^
File "/Users/evilsocket/tinygrad/tinygrad/runtime/ops_metal.py", line 12, in wait_check
raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x600000c211d0 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}
LAZYCACHE=0 does not fix it.
JIT=0 does.
JIT=2
should work, it only happens on M1 and M2, not M3 and we are not entirely sure why
@chenyuxyz that works, but if my understanding is correct that will skip apply_graph_to_jit in engine/jit.py, wouldn't it? Is that equivalent to disabling JIT? I'm trying to RTFM but the JIT env var doesn't seem to be documented much in /doc/ unless i've missed something.
I'll gladly help debugging and possibly fixing this once I understand how that env var influences the execution flow.
@evilsocket How often do you hit this on you machine?
@nimlgen on every run