fmod producing the wrong results on the GPU

Question

fmod producing the wrong results on the GPU

timgrant opened this issue 3 years ago · comments

Problem

We have received reports of fmod giving the wrong answer on the GPU when the denominator is negative. I was able to confirm this behavior in testshade using a simple shader, given below.

Interestingly, fmod does produce the correct result when the argument is a local variable or a non-varying parameter, but not with varying parameters.

Steps to Reproduce

shader test (float a = -0.666667 [[ int lockgeom=0 ]],
             float b = -0.666667 [[ int lockgeom=1 ]],
             output color Cout = 0)
{
    float denom = 0.5;
    printf("fmod(%f, %f) = %f\n", a, denom, fmod(a, denom));
    printf("fmod(%f, %f) = %f\n", b, denom, fmod(b, denom));
}

$ testshade --optix -g 1 1 test
fmod(-0.666667, 0.500000) = 0.333333
fmod(-0.666667, 0.500000) = -0.166667

$ testshade -g 1 1 test
fmod(-0.666667, 0.500000) = -0.166667
fmod(-0.666667, 0.500000) = -0.166667

Versions

OSL branch/version: master (b0ed183)
OS: Ubuntu 20.04
C++ compiler: clang 12.0.0
LLVM version: LLVM 12.0.0
OIIO version: master (a07efab5)

Larry Gritz · Answer 1 · Fri Jun 25 2021 07:56:15 GMT+0800 (China Standard Time)

fmod(-0.666667, 0.500000) = 0.333333
fmod(-0.666667, 0.500000) = -0.166667

Whoa, it's different with the same input values and only differing in lockgeom?

timgrant · Answer 2 · Fri Jun 25 2021 08:30:19 GMT+0800 (China Standard Time)

It looks like the values are getting constant folded in the non-varying case, based on the PTX. Chaging the optimization level for testshade to O1 produces the correct results, but it also changes the remainder calculation significantly. I'm not quite sure what to make of that. Peeking at the LLVM IR might be illuminating.

Larry Gritz · Answer 3 · Fri Jun 25 2021 08:39:14 GMT+0800 (China Standard Time)

It means we're doing the calculation differently for the constant-folded case versus the runtime case.

timgrant · Answer 4 · Fri Jun 25 2021 09:55:11 GMT+0800 (China Standard Time)

The LLVM IR is pretty straightforward, especially for O2 :

  ;; O2
  %23 = load float, float* %20, align 4
  %24 = frem float %23, 5.000000e-01

  ;; O1
  %27 = load float, float* %20, align 4
  %28 = fmul fast float %27, 2.000000e+00; -0.666667 * 2.0    = -1.333334
  %29 = fptosi float %28 to i32;           sint(-1.333334)    = -1
  %30 = sitofp i32 %29 to float;           float(-1)          = -1.0
  %31 = fmul fast float %30, 5.000000e-01; -1.0 * 0.5         = -0.5
  %32 = fsub fast float %27, %31;          -0.666667 - (-0.5) = -0.16667

If I'm reading the PTX correctly, this is the remainder calculation from the O2 case:

  ;; %f8 holds the value of float a (-0.666667)
  div.rn.f32 %f4, %f8, 0f3F000000; -0.666667 / 0.5    = -1.333334
  cvt.rmi.f32.f32 %f5, %f4;        floor( -1.333334 ) = -2.0
  mul.f32 %f6, %f5, 0f3F000000;    2.0 * 0.5          = -1.0;
  sub.f32 %f7, %f8, %f6;           -0.666667 - (-1.0) = 0.333333

The key difference is in the rounding operation. cvt.rmi rounds to the nearest integer, towards negative infinity (from the PTX ISA).

I guess this is the purview of the NVPTX backend. It's possible that using different options in LLVM_Util::ptx_compile_group() can produce the desired behavior, but I'm not sure at this point.

FWIW, I tested fmod in a simple CUDA program and it did the right thing.

timgrant · Answer 5 · Fri Jun 25 2021 23:43:44 GMT+0800 (China Standard Time)

It should come as no surprise, but the PTX instructions follow the pattern defined in the NVPTXInstrInfo.td. Given that, I don't think we can leave it up to NVPTX to handle frem lowering. I suppose we want something more like the O1 case?

Christopher Kulla · Answer 6 · Sat Jun 26 2021 01:13:27 GMT+0800 (China Standard Time)

Is it possible to submit a fix to the NVPTX backend in LLVM? Am I understanding correctly that LLVM has its own IR level simplification of fmod that is competing with the one from the NVPTX backend? In that case - is there a way to force LLVM to always do what it does in the O1 case?

timgrant · Answer 7 · Sat Jun 26 2021 06:53:12 GMT+0800 (China Standard Time)

Chris, I think your understanding is correct. It would be great to patch this in NVPTX, but I couldn't begin to put a time frame on that.

In the O1 case, we get the right behavior on the GPU because it forces use of safe_fmod. In terms of edit distance, this might be the minimal patch to achieve that result. But it's worth finding out if there is an NVPTX option that affects this if it can help us avoid a special case.

diff --git a/src/liboslexec/llvm_gen.cpp b/src/liboslexec/llvm_gen.cpp
index a93faeba..95d414f4 100644
--- a/src/liboslexec/llvm_gen.cpp
+++ b/src/liboslexec/llvm_gen.cpp
@@ -789,7 +789,7 @@ LLVMGEN (llvm_gen_modulus)
         if (!a || !b)
             return false;
         llvm::Value *r;
-        if (B.is_constant() && ! rop.is_zero(B))
+        if (! rop.use_optix() && B.is_constant() && ! rop.is_zero(B))
             r = rop.ll.op_mod (a, b);
         else
             r = rop.ll.call_function (safe_mod, a, b);

Larry Gritz · Answer 8 · Sat Jun 26 2021 07:00:55 GMT+0800 (China Standard Time)

Even if we get a fix in NVPTX in the very next major version (13), it will probably be years before that's the minimum required LLVM version for OSL. So we need a workaround no matter what.

I'm fine with this suggested workaround, unless there is already some NVPTX option that does what we need.