recp / cglm

📽 Highly Optimized 2D / 3D Graphics Math (glm) for C

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Bug in glm_mat4_inv_sse2

deadwanderer opened this issue · comments

PREFACE: I completely lack both the linear algebra and SIMD skills to fix this; I merely note the issue's existence!

I was calculating the inverse transpose of the model matrix to properly transform normals for my lighting shader, and noted that the inverse transpose was wrong when I calculated on the CPU with cglm. I compared the results with calculating it in the shader (OpenGL) and in GLM, and found that the issue occurred in glm_mat4_inv, when using SSE (thus following the glm_mat4_inv_sse2 path.

When I manually commented out the SSE path and forced the non-SIMD path in glm_mat4_inv, the calculations came out correct, matching OpenGL shader and GLM results.

Here is the code I used to calculate a sample model matrix, and its inverse transpose, along with the code to print out resulting matrices at each step (NOTE: I've flipped/corrected? the mat4 printing (#288), so the matrices print out column-major, to match glm's output which I was comparing. I also increased precision to 6 decimal places, again to match glm's output):

mat4 model;
mat4 inverse;
mat4 transpose;

// Create an identity matrix for the model
glm_mat4_identity(model);
fprintf(stderr, "Identity matrix:\n");
glm_mat4_print(model, stderr);

// Translate the model matrix
glm_translate(model, (vec3){2.0f, 5.0f, -15.0f});
fprintf(stderr, "Translated matrix:\n");
glm_mat4_print(model, stderr);

// Apply a rotation to the model matrix in the x and y dimensions.
glm_rotate(model, glm_rad(45.0f), (vec3){1.0f, 0.5f, 0.0f});
fprintf(stderr, "Model matrix:\n");
glm_mat4_print(model, stderr);

// Invert the matrix, and store in inverse
glm_mat4_inv(model, inverse);
fprintf(stderr, "Inverse matrix:\n");
glm_mat4_print(inverse, stderr);

// Transpose the inverse matrix, and store in transpose
glm_mat4_transpose_to(inverse, transpose);
fprintf(stderr, "Transpose matrix:\n");
glm_mat4_print(transpose, stderr);

And here are the resulting values.

Identity matrix:
  |  1.000000  0.000000  0.000000  0.000000  |
  |  0.000000  1.000000  0.000000  0.000000  |
  |  0.000000  0.000000  1.000000  0.000000  |
  |  0.000000  0.000000  0.000000  1.000000  |
Translated matrix:
  |  1.000000  0.000000  0.000000   0.000000  |
  |  0.000000  1.000000  0.000000   0.000000  |
  |  0.000000  0.000000  1.000000   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Model matrix:
  |  0.941421  0.117157 -0.316228   0.000000  |
  |  0.117157  0.765685  0.632456   0.000000  |
  |  0.316228 -0.632456  0.707107   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Inverse matrix:
  |  0.967995  0.120464  0.325154  -0.000000  |
  | -0.120464 -0.787298  0.650308  -0.000000  |
  | -0.325154  0.650308  0.727066  -0.000000  |
  |  7.415617 -5.577195 -13.507222  -1.028227  |
Transpose matrix:
  |  0.967995 -0.120464  -0.325154  7.415617  |
  |  0.120464 -0.787298   0.650308 -5.577195  |
  |  0.325154  0.650308   0.727066 -13.507222  |
  | -0.000000 -0.000000  -0.000000 -1.028227  |

When I forced the non-SIMD path, I got the following results:

Identity matrix:
  |  1.000000  0.000000  0.000000  0.000000  |
  |  0.000000  1.000000  0.000000  0.000000  |
  |  0.000000  0.000000  1.000000  0.000000  |
  |  0.000000  0.000000  0.000000  1.000000  |
Translated matrix:
  |  1.000000  0.000000  0.000000   0.000000  |
  |  0.000000  1.000000  0.000000   0.000000  |
  |  0.000000  0.000000  1.000000   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Model matrix:
  |  0.941421  0.117157 -0.316228   0.000000  |
  |  0.117157  0.765685  0.632456   0.000000  |
  |  0.316228 -0.632456  0.707107   0.000000  |
  |  2.000000  5.000000 -15.000000   1.000000  |
Inverse matrix:
  |  0.941421  0.117157  0.316228  -0.000000  |
  |  0.117157  0.765685 -0.632456   0.000000  |
  | -0.316228  0.632456  0.707107  -0.000000  |
  | -7.212046  5.424091  13.136425   1.000000  |
Transpose matrix:
  |  0.941421  0.117157  -0.316228 -7.212046  |
  |  0.117157  0.765685   0.632456  5.424091  |
  |  0.316228 -0.632456   0.707107  13.136425  |
  | -0.000000  0.000000  -0.000000  1.000000  |

For comparison, here's the glm code I used:

// Create an identity matrix for the model.
glm::mat4 model = glm::mat4(1.0f);
fprintf(stderr, "Identity matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Translate the matrix
model = glm::translate(model, glm::vec3(2.0f, 5.0f, -15.0f));
fprintf(stderr, "Translated matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Rotate the matrix along the x and y axes
model = glm::rotate(model, glm::radians(45.0f), glm::vec3(1.0f, 0.5f, 0.0f));
fprintf(stderr, "Model matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());

// Invert the model matrix
glm::mat4 inverse = glm::inverse(model);
fprintf(stderr, "Inverse matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(inverse).c_str());

// Transpose the inverted matrix.
glm::mat4 transpose = glm::transpose(inverse);
fprintf(stderr, "Transpose matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(transpose).c_str());

And here are the glm results (which match the non-SIMD path of glm_mat4_inv, slight formatting differences notwithstanding):

Identity matrix:
  |  1.000000, 0.000000, 0.000000, 0.000000 |
  |  0.000000, 1.000000, 0.000000, 0.000000 |
  |  0.000000, 0.000000, 1.000000, 0.000000 |
  |  0.000000, 0.000000, 0.000000, 1.000000 |
Translated matrix:
  |  1.000000, 0.000000, 0.000000, 0.000000 |
  |  0.000000, 1.000000, 0.000000, 0.000000 |
  |  0.000000, 0.000000, 1.000000, 0.000000 |
  |  2.000000, 5.000000, -15.000000, 1.000000 |
Model matrix:
  |  0.941421, 0.117157, -0.316228, 0.000000 |
  |  0.117157, 0.765685, 0.632456, 0.000000 |
  |  0.316228, -0.632456, 0.707107, 0.000000 |
  |  2.000000, 5.000000, -15.000000, 1.000000 |
Inverse matrix:
  |  0.941421, 0.117157, 0.316228, -0.000000 |
  |  0.117157, 0.765685, -0.632456, 0.000000 |
  | -0.316228, 0.632456, 0.707107, -0.000000 |
  | -7.212046, 5.424091, 13.136425, 1.000000 |
Transpose matrix:
  |  0.941421, 0.117157, -0.316228, -7.212046 |
  |  0.117157, 0.765685, 0.632456, 5.424091 |
  |  0.316228, -0.632456, 0.707107, 13.136425 |
  | -0.000000, 0.000000, -0.000000, 1.000000 |

So pretty clearly, something goes wonky and wrong in the SSE path for glm_mat4_inv.

Hi @deadwanderer

Thanks for the feedbacks,

I'll take a look at asap, many thanks, if anyone could find the issue before me a PR would also be helpful

This is ARM64 output (macOS M1 Max and Virtual Windows11_ARM MSVC are similar):

Screenshot 2023-04-08 at 12 32 00 AM

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

This is x86 output :

Screenshot 2023-04-08 at 12 31 51 AM

(Virtual Windows11_ARM MSVC)
Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

I got similar results. I'll try to test on my Intel Mac later. What's your environment may I ask? Also do you have latest cglm ?

glm_arch_print_name(stderr || stdout) which is added with glm_arch_print_name could be helpful to print selected simd path in general. glm_arch_print_name can be called at top to see what is the arch...

Thanks for the response. Using my example code again, against a fresh pull of the cglm repository, and calling the glm_arch_print_name function at the top, gave the following results:

cglm arch: x86 SSE*

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.96799 -0.12046 -0.32515   7.41562  |
  |  0.12046 -0.78730  0.65031  -5.57719  |
  |  0.32515  0.65031  0.72707 -13.50722  |
  | -0.00000 -0.00000 -0.00000  -1.02823  |

Transpose matrix:
Matrix (float4x4):
  |  0.96799  0.12046   0.32515 -0.00000  |
  | -0.12046 -0.78730   0.65031 -0.00000  |
  | -0.32515  0.65031   0.72707 -0.00000  |
  |  7.41562 -5.57719 -13.50722 -1.02823  |

Oh, and I'm on Windows 10 desktop.

I suspect the compiler is converting mixed +/-0.0f constants like this, into -0.0f, -0.0f, -0.0f, -0.0f or 0.0f, 0.0f, 0.0f, 0.0f. Are you compiling with /fp:fast by chance?

That was it! I removed /fp:fast, and am now getting identical results on SSE and non-SSE paths.

Thank you, @recp and @gottfriedleibniz for all the troubleshooting and help!

Ideally, functions which mix +/-0.0f should be robust enough against fp:fast. Especially when they still work with ffast-math on gcc/clang.

Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ... (then casting _mm_castsi128_ps) would be a more robust alternative. Otherwise, MSVC pragmas are an option.

All uses of -0.0f should probably be looked at, I guess. Maybe.

@deadwanderer thanks, @gottfriedleibniz good catch, many thanks!

I used /fp:fast but got similar results (SSE2 path on virtual Windows11 arm insider, VS Community 2022 (ARM 64-bit) - 17.6.0 Preview 2.0), not sure how to reproduce the issue 🤷‍♂️

Screenshot 2023-04-09 at 12 43 22 AM

Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ... (then casting _mm_castsi128_ps) would be a more robust alternative. Otherwise, MSVC pragmas are an option.

+1 for this. It is better to make it work on compilers including msvc :)

Otherwise, MSVC pragmas are an option.

Which pragmas may I ask? How to make it better with msvc pragmas on msvc? A PR to make improvements as you mentioned would also be awesome :)

Finally had time to examine this, fp:fast is folding both expressions into a single constant:

x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);

Resulting in with fp:fast:

movaps  xmm15, XMMWORD PTR __xmm@80000000000000008000000000000000
// ...
xorps   xmm4, xmm15
// ...
xorps   xmm13, xmm15
// ...
xorps   xmm14, xmm15
// ...
xorps   xmm12, xmm15

Without fp:fast:

shufps  xmm0, xmm10, 153                    ; 00000099H
movaps  xmm11, xmm4
movaps  XMMWORD PTR x9$1$[rsp], xmm0
// ...
xorps   xmm4, xmm10
// ...
xorps   xmm14, xmm10
// ...
xorps   xmm15, XMMWORD PTR x9$1$[rsp]
// ...
xorps   xmm13, XMMWORD PTR x9$1$[rsp]

Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0)) seems to solve this. Although, there may be some superior alternative that I am unaware of (and other cases, esp. after inlining).

Which pragmas may I ask?

For reference: pragma(float_control(precise, on, push)) + __pragma(float_control(pop)). Although, I probably would not go this route.

@gottfriedleibniz many thanks

Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0)) seems to solve this.

any chance to get a PR? or I'll do it asap

For reference: pragma(float_control(precise, on, push)) + __pragma(float_control(pop)). Although, I probably would not go this route.

thanks

Finally had time to examine this, fp:fast is folding both expressions into a single constant:

x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);

also to avoid similar scenarios, any idea why fp:fast option doing this since both x8 and x9 are used in different places?

any chance to get a PR? or I'll do it asap

You can.

any idea why fp:fast option doing this since both x8 and x9 are used in different places?

I suppose some byproduct of subexpression elimination with fp:fast not having to honour signed zeros. I have not looked into this any further.

@gottfriedleibniz sure, I'll asap, thanks for the infos

@gottfriedleibniz I created a PR for this, it would be nice to get a review: #291

@deadwanderer is it possible to test that PR with /fp:fast to validate that is working as expected (previous tests)?

thanks

Here are my results using that PR's changes:

cglm arch: x86 SSE*

Identity matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000  0.00000  |
  |  0.00000  1.00000  0.00000  0.00000  |
  |  0.00000  0.00000  1.00000  0.00000  |
  |  0.00000  0.00000  0.00000  1.00000  |

Translated matrix:
Matrix (float4x4):
  |  1.00000  0.00000  0.00000   2.00000  |
  |  0.00000  1.00000  0.00000   5.00000  |
  |  0.00000  0.00000  1.00000 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Model matrix:
Matrix (float4x4):
  |  0.94142  0.11716  0.31623   2.00000  |
  |  0.11716  0.76569 -0.63246   5.00000  |
  | -0.31623  0.63246  0.70711 -15.00000  |
  |  0.00000  0.00000  0.00000   1.00000  |

Inverse matrix:
Matrix (float4x4):
  |  0.94142  0.11716 -0.31623  -7.21205  |
  |  0.11716  0.76569  0.63246   5.42409  |
  |  0.31623 -0.63246  0.70711  13.13643  |
  | -0.00000  0.00000 -0.00000   1.00000  |

Transpose matrix:
Matrix (float4x4):
  |  0.94142  0.11716   0.31623 -0.00000  |
  |  0.11716  0.76569  -0.63246  0.00000  |
  | -0.31623  0.63246   0.70711 -0.00000  |
  | -7.21205  5.42409  13.13643  1.00000  |

Looks good to me! Thanks so much, @recp and @gottfriedleibniz!

Seems good. Passes my tests.

@deadwanderer @gottfriedleibniz many thanks, that PR is merged.