tinygrad / tinygrad

You like pytorch? You like micrograd? You love tinygrad! ❤️

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

beautiful_mnist.py: RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery)"

evilsocket opened this issue · comments

Related to #3857 ... I'm getting this even after rebooting, quite systematically despite what's reported in #3857.

Using 6ec7dbc@master (but it happens also with the latest packaged version), macOS 14.5, Apple M1 Max on a MacBook Pro 2021:

cd tinygrad && python3.11 -m pip install -e . && python3.11 examples/beautiful_mnist.py

...
...

loss:   1.14 test_accuracy:   nan%:   4%|█████                                                                                                                 | 3/70 [00:01<00:42,  1.57it/s]Traceback (most recent call last):
  File "/Users/evilsocket/tinygrad/examples/beautiful_mnist.py", line 44, in <module>
    t.set_description(f"loss: {loss.item():6.2f} test_accuracy: {test_acc:5.2f}%")
                               ^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 3122, in _wrapper
    ret = fn(*args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 273, in item
    return self._data().cast(self.dtype.fmt)[0]
           ^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 3101, in _wrapper
    if _METADATA.get() is not None: return fn(*args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 244, in _data
    cpu = self.cast(self.dtype.scalar()).contiguous().to("CLANG").realize()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 3101, in _wrapper
    if _METADATA.get() is not None: return fn(*args, **kwargs)
                                           ^^^^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/tensor.py", line 203, in realize
    run_schedule(*self.schedule_with_vars(*lst), do_update_stats=do_update_stats)
  File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 202, in run_schedule
    ei.run(var_vals, do_update_stats=do_update_stats)
  File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 154, in run
    et = self.prg(bufs, var_vals if var_vals is not None else {}, wait=wait or DEBUG >= 2)
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 116, in __call__
    self.copy(dest, src)
  File "/Users/evilsocket/tinygrad/tinygrad/engine/realize.py", line 111, in copy
    dest.copyin(src.as_buffer(allow_zero_copy=True))  # may allocate a CPU buffer depending on allow_zero_copy
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/device.py", line 110, in as_buffer
    if (force_zero_copy or allow_zero_copy) and hasattr(self.allocator, 'as_buffer'): return self.allocator.as_buffer(self._buf)
                                                                                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/runtime/ops_metal.py", line 91, in as_buffer
    self.device.synchronize()
  File "/Users/evilsocket/tinygrad/tinygrad/runtime/ops_metal.py", line 108, in synchronize
    for cbuf in self.mtl_buffers_in_flight: wait_check(cbuf)
                                            ^^^^^^^^^^^^^^^^
  File "/Users/evilsocket/tinygrad/tinygrad/runtime/ops_metal.py", line 12, in wait_check
    raise RuntimeError(error)
RuntimeError: Error Domain=MTLCommandBufferErrorDomain Code=1 "Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim)" UserInfo={NSLocalizedDescription=Discarded (victim of GPU error/recovery) (00000005:kIOGPUCommandBufferCallbackErrorInnocentVictim), NSUnderlyingError=0x600000c211d0 {Error Domain=IOGPUCommandQueueErrorDomain Code=5 "(null)"}}

LAZYCACHE=0 does not fix it.
JIT=0 does.

JIT=2 should work, it only happens on M1 and M2, not M3 and we are not entirely sure why

@chenyuxyz that works, but if my understanding is correct that will skip apply_graph_to_jit in engine/jit.py, wouldn't it? Is that equivalent to disabling JIT? I'm trying to RTFM but the JIT env var doesn't seem to be documented much in /doc/ unless i've missed something.

I'll gladly help debugging and possibly fixing this once I understand how that env var influences the execution flow.

@evilsocket How often do you hit this on you machine?