When ub_overlap_rs_dgrad is set to True, the error "Caught signal 8 (Floating point exception: integer divide by zero)" is raised.

Question

When ub_overlap_rs_dgrad is set to True, the error "Caught signal 8 (Floating point exception: integer divide by zero)" is raised.

JJGSBGQ opened this issue 2 months ago · comments

Setting ub_overlap_rs_dgrad to True in megatron-LM will raise "Caught signal 8 (Floating point exception: integer divide by zero) "error, which was eventually found to be caused by a problem with the tex.gemm calculation in the backward.

JJGSBGQ commented 2 months ago

@minitu