TestMnistConv.test_conv produces wrong number

Question

TestMnistConv.test_conv produces wrong number

Aetf opened this issue 7 years ago · comments

~~While the CPU version produces consistent accuracy number after 50 iterations, our RPC version generates different number.~~

After fixes for GPU landed, the GPU version also has the same behavior.

Steps to reproduce

launch executor
python test_mnist_tf.py TestMnistConv.test_conv

Expected result

Test passes

Actual

The generated accuracy doesn't equal to the one generated by CPU in TF.

Traceback (most recent call last):
  File "test_mnist_tf.py", line 129, in test_conv
    self.assertEquals(actual, expected)
AssertionError: 0.249 != 0.68349999

Attached log: test_conv.tar.gz

Tan N. Le · Answer 1 · Tue Jul 11 2017 08:00:04 GMT+0800 (China Standard Time)

I have a different log. It is the segmentation fault.
test_conv.tar.gz

Perhaps, we have the different way to compile the source.

Aetf · Answer 2 · Thu Jul 20 2017 03:28:58 GMT+0800 (China Standard Time)

The log is not fully flushed when it crashes. You can run p logging::logger->flush() after the crash in gdb to flush the log. I need to know the exact op kernel running while the crash happens.

Also this looks like a different issue. ~~Please open an new issue.~~

The stack trace is identical to #14. Please use that issue to track the segfault problem.

Aetf · Answer 3 · Wed Aug 09 2017 15:48:44 GMT+0800 (China Standard Time)

Fixed in ddc12ff