SymbioticLab / Salus

Fine-grained GPU sharing primitives

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

TestMnistConv.test_conv produces wrong number

Aetf opened this issue · comments

commented

While the CPU version produces consistent accuracy number after 50 iterations, our RPC version generates different number.

After fixes for GPU landed, the GPU version also has the same behavior.

Steps to reproduce

  1. launch executor
  2. python test_mnist_tf.py TestMnistConv.test_conv

Expected result

Test passes

Actual

The generated accuracy doesn't equal to the one generated by CPU in TF.

Traceback (most recent call last):
  File "test_mnist_tf.py", line 129, in test_conv
    self.assertEquals(actual, expected)
AssertionError: 0.249 != 0.68349999

Attached log: test_conv.tar.gz

I have a different log. It is the segmentation fault.
test_conv.tar.gz

Perhaps, we have the different way to compile the source.

commented

The log is not fully flushed when it crashes. You can run p logging::logger->flush() after the crash in gdb to flush the log. I need to know the exact op kernel running while the crash happens.

Also this looks like a different issue. Please open an new issue.

The stack trace is identical to #14. Please use that issue to track the segfault problem.

commented

Fixed in ddc12ff