SymbioticLab / Salus

Fine-grained GPU sharing primitives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MNIST training hangs in ApplyAdam kernel

Aetf opened this issue · comments

commented

This happens regardless the executor is using GPU or not.

Steps to reproduce

  1. run executor with EXEC_SCHED_USE_GPU=1 or EXEC_SCHED_USE_GPU=0
  2. run test pytest test_mnist_tf.py

Expected

Test passes

Actual

Executor blocks waiting for kernel to finish. In the mean time the GPU utilization is zero.
The block always happens in AdamApply operation.

Logs:
GPU: exec.output.zip
CPU: exec.output.zip

commented

The bug was introduced in f1b9331, found by git bisect

commented

Fixed in 61fc138