Bug in glm_mat4_inv_sse2
deadwanderer opened this issue · comments
PREFACE: I completely lack both the linear algebra and SIMD skills to fix this; I merely note the issue's existence!
I was calculating the inverse transpose of the model matrix to properly transform normals for my lighting shader, and noted that the inverse transpose was wrong when I calculated on the CPU with cglm. I compared the results with calculating it in the shader (OpenGL) and in GLM, and found that the issue occurred in glm_mat4_inv
, when using SSE (thus following the glm_mat4_inv_sse2
path.
When I manually commented out the SSE path and forced the non-SIMD path in glm_mat4_inv
, the calculations came out correct, matching OpenGL shader and GLM results.
Here is the code I used to calculate a sample model matrix, and its inverse transpose, along with the code to print out resulting matrices at each step (NOTE: I've flipped/corrected? the mat4 printing (#288), so the matrices print out column-major, to match glm's output which I was comparing. I also increased precision to 6 decimal places, again to match glm's output):
mat4 model;
mat4 inverse;
mat4 transpose;
// Create an identity matrix for the model
glm_mat4_identity(model);
fprintf(stderr, "Identity matrix:\n");
glm_mat4_print(model, stderr);
// Translate the model matrix
glm_translate(model, (vec3){2.0f, 5.0f, -15.0f});
fprintf(stderr, "Translated matrix:\n");
glm_mat4_print(model, stderr);
// Apply a rotation to the model matrix in the x and y dimensions.
glm_rotate(model, glm_rad(45.0f), (vec3){1.0f, 0.5f, 0.0f});
fprintf(stderr, "Model matrix:\n");
glm_mat4_print(model, stderr);
// Invert the matrix, and store in inverse
glm_mat4_inv(model, inverse);
fprintf(stderr, "Inverse matrix:\n");
glm_mat4_print(inverse, stderr);
// Transpose the inverse matrix, and store in transpose
glm_mat4_transpose_to(inverse, transpose);
fprintf(stderr, "Transpose matrix:\n");
glm_mat4_print(transpose, stderr);
And here are the resulting values.
Identity matrix:
| 1.000000 0.000000 0.000000 0.000000 |
| 0.000000 1.000000 0.000000 0.000000 |
| 0.000000 0.000000 1.000000 0.000000 |
| 0.000000 0.000000 0.000000 1.000000 |
Translated matrix:
| 1.000000 0.000000 0.000000 0.000000 |
| 0.000000 1.000000 0.000000 0.000000 |
| 0.000000 0.000000 1.000000 0.000000 |
| 2.000000 5.000000 -15.000000 1.000000 |
Model matrix:
| 0.941421 0.117157 -0.316228 0.000000 |
| 0.117157 0.765685 0.632456 0.000000 |
| 0.316228 -0.632456 0.707107 0.000000 |
| 2.000000 5.000000 -15.000000 1.000000 |
Inverse matrix:
| 0.967995 0.120464 0.325154 -0.000000 |
| -0.120464 -0.787298 0.650308 -0.000000 |
| -0.325154 0.650308 0.727066 -0.000000 |
| 7.415617 -5.577195 -13.507222 -1.028227 |
Transpose matrix:
| 0.967995 -0.120464 -0.325154 7.415617 |
| 0.120464 -0.787298 0.650308 -5.577195 |
| 0.325154 0.650308 0.727066 -13.507222 |
| -0.000000 -0.000000 -0.000000 -1.028227 |
When I forced the non-SIMD path, I got the following results:
Identity matrix:
| 1.000000 0.000000 0.000000 0.000000 |
| 0.000000 1.000000 0.000000 0.000000 |
| 0.000000 0.000000 1.000000 0.000000 |
| 0.000000 0.000000 0.000000 1.000000 |
Translated matrix:
| 1.000000 0.000000 0.000000 0.000000 |
| 0.000000 1.000000 0.000000 0.000000 |
| 0.000000 0.000000 1.000000 0.000000 |
| 2.000000 5.000000 -15.000000 1.000000 |
Model matrix:
| 0.941421 0.117157 -0.316228 0.000000 |
| 0.117157 0.765685 0.632456 0.000000 |
| 0.316228 -0.632456 0.707107 0.000000 |
| 2.000000 5.000000 -15.000000 1.000000 |
Inverse matrix:
| 0.941421 0.117157 0.316228 -0.000000 |
| 0.117157 0.765685 -0.632456 0.000000 |
| -0.316228 0.632456 0.707107 -0.000000 |
| -7.212046 5.424091 13.136425 1.000000 |
Transpose matrix:
| 0.941421 0.117157 -0.316228 -7.212046 |
| 0.117157 0.765685 0.632456 5.424091 |
| 0.316228 -0.632456 0.707107 13.136425 |
| -0.000000 0.000000 -0.000000 1.000000 |
For comparison, here's the glm code I used:
// Create an identity matrix for the model.
glm::mat4 model = glm::mat4(1.0f);
fprintf(stderr, "Identity matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());
// Translate the matrix
model = glm::translate(model, glm::vec3(2.0f, 5.0f, -15.0f));
fprintf(stderr, "Translated matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());
// Rotate the matrix along the x and y axes
model = glm::rotate(model, glm::radians(45.0f), glm::vec3(1.0f, 0.5f, 0.0f));
fprintf(stderr, "Model matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(model).c_str());
// Invert the model matrix
glm::mat4 inverse = glm::inverse(model);
fprintf(stderr, "Inverse matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(inverse).c_str());
// Transpose the inverted matrix.
glm::mat4 transpose = glm::transpose(inverse);
fprintf(stderr, "Transpose matrix:\n");
fprintf(stderr, "%s\n", glm::to_string(transpose).c_str());
And here are the glm results (which match the non-SIMD path of glm_mat4_inv
, slight formatting differences notwithstanding):
Identity matrix:
| 1.000000, 0.000000, 0.000000, 0.000000 |
| 0.000000, 1.000000, 0.000000, 0.000000 |
| 0.000000, 0.000000, 1.000000, 0.000000 |
| 0.000000, 0.000000, 0.000000, 1.000000 |
Translated matrix:
| 1.000000, 0.000000, 0.000000, 0.000000 |
| 0.000000, 1.000000, 0.000000, 0.000000 |
| 0.000000, 0.000000, 1.000000, 0.000000 |
| 2.000000, 5.000000, -15.000000, 1.000000 |
Model matrix:
| 0.941421, 0.117157, -0.316228, 0.000000 |
| 0.117157, 0.765685, 0.632456, 0.000000 |
| 0.316228, -0.632456, 0.707107, 0.000000 |
| 2.000000, 5.000000, -15.000000, 1.000000 |
Inverse matrix:
| 0.941421, 0.117157, 0.316228, -0.000000 |
| 0.117157, 0.765685, -0.632456, 0.000000 |
| -0.316228, 0.632456, 0.707107, -0.000000 |
| -7.212046, 5.424091, 13.136425, 1.000000 |
Transpose matrix:
| 0.941421, 0.117157, -0.316228, -7.212046 |
| 0.117157, 0.765685, 0.632456, 5.424091 |
| 0.316228, -0.632456, 0.707107, 13.136425 |
| -0.000000, 0.000000, -0.000000, 1.000000 |
So pretty clearly, something goes wonky and wrong in the SSE path for glm_mat4_inv
.
Thanks for the feedbacks,
I'll take a look at asap, many thanks, if anyone could find the issue before me a PR would also be helpful
This is ARM64 output (macOS M1 Max and Virtual Windows11_ARM MSVC are similar):
Identity matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 0.00000 |
| 0.00000 1.00000 0.00000 0.00000 |
| 0.00000 0.00000 1.00000 0.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Translated matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 2.00000 |
| 0.00000 1.00000 0.00000 5.00000 |
| 0.00000 0.00000 1.00000 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Model matrix:
Matrix (float4x4):
| 0.94142 0.11716 0.31623 2.00000 |
| 0.11716 0.76569 -0.63246 5.00000 |
| -0.31623 0.63246 0.70711 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Inverse matrix:
Matrix (float4x4):
| 0.94142 0.11716 -0.31623 -7.21205 |
| 0.11716 0.76569 0.63246 5.42409 |
| 0.31623 -0.63246 0.70711 13.13643 |
| -0.00000 0.00000 -0.00000 1.00000 |
Transpose matrix:
Matrix (float4x4):
| 0.94142 0.11716 0.31623 -0.00000 |
| 0.11716 0.76569 -0.63246 0.00000 |
| -0.31623 0.63246 0.70711 -0.00000 |
| -7.21205 5.42409 13.13643 1.00000 |
This is x86 output :
(Virtual Windows11_ARM MSVC)Identity matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 0.00000 |
| 0.00000 1.00000 0.00000 0.00000 |
| 0.00000 0.00000 1.00000 0.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Translated matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 2.00000 |
| 0.00000 1.00000 0.00000 5.00000 |
| 0.00000 0.00000 1.00000 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Model matrix:
Matrix (float4x4):
| 0.94142 0.11716 0.31623 2.00000 |
| 0.11716 0.76569 -0.63246 5.00000 |
| -0.31623 0.63246 0.70711 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Inverse matrix:
Matrix (float4x4):
| 0.94142 0.11716 -0.31623 -7.21205 |
| 0.11716 0.76569 0.63246 5.42409 |
| 0.31623 -0.63246 0.70711 13.13643 |
| -0.00000 0.00000 -0.00000 1.00000 |
Transpose matrix:
Matrix (float4x4):
| 0.94142 0.11716 0.31623 -0.00000 |
| 0.11716 0.76569 -0.63246 0.00000 |
| -0.31623 0.63246 0.70711 -0.00000 |
| -7.21205 5.42409 13.13643 1.00000 |
I got similar results. I'll try to test on my Intel Mac later. What's your environment may I ask? Also do you have latest cglm ?
glm_arch_print_name(stderr || stdout)
which is added with glm_arch_print_name could be helpful to print selected simd path in general. glm_arch_print_name
can be called at top to see what is the arch...
Thanks for the response. Using my example code again, against a fresh pull of the cglm
repository, and calling the glm_arch_print_name
function at the top, gave the following results:
cglm arch: x86 SSE*
Identity matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 0.00000 |
| 0.00000 1.00000 0.00000 0.00000 |
| 0.00000 0.00000 1.00000 0.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Translated matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 2.00000 |
| 0.00000 1.00000 0.00000 5.00000 |
| 0.00000 0.00000 1.00000 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Model matrix:
Matrix (float4x4):
| 0.94142 0.11716 0.31623 2.00000 |
| 0.11716 0.76569 -0.63246 5.00000 |
| -0.31623 0.63246 0.70711 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Inverse matrix:
Matrix (float4x4):
| 0.96799 -0.12046 -0.32515 7.41562 |
| 0.12046 -0.78730 0.65031 -5.57719 |
| 0.32515 0.65031 0.72707 -13.50722 |
| -0.00000 -0.00000 -0.00000 -1.02823 |
Transpose matrix:
Matrix (float4x4):
| 0.96799 0.12046 0.32515 -0.00000 |
| -0.12046 -0.78730 0.65031 -0.00000 |
| -0.32515 0.65031 0.72707 -0.00000 |
| 7.41562 -5.57719 -13.50722 -1.02823 |
Oh, and I'm on Windows 10 desktop.
I suspect the compiler is converting mixed +/-0.0f
constants like this, into -0.0f, -0.0f, -0.0f, -0.0f
or 0.0f, 0.0f, 0.0f, 0.0f
. Are you compiling with /fp:fast
by chance?
That was it! I removed /fp:fast, and am now getting identical results on SSE and non-SSE paths.
Thank you, @recp and @gottfriedleibniz for all the troubleshooting and help!
Ideally, functions which mix +/-0.0f
should be robust enough against fp:fast
. Especially when they still work with ffast-math
on gcc/clang.
Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ...
(then casting _mm_castsi128_ps
) would be a more robust alternative. Otherwise, MSVC pragmas are an option.
All uses of -0.0f
should probably be looked at, I guess. Maybe.
@deadwanderer thanks, @gottfriedleibniz good catch, many thanks!
I used /fp:fast
but got similar results (SSE2 path on virtual Windows11 arm insider, VS Community 2022 (ARM 64-bit) - 17.6.0 Preview 2.0), not sure how to reproduce the issue 🤷♂️
Given that the variable is used for extracting the sign of a float. I wonder if replacing those constants with its hex equivalent 0x80000000u, 0x0u, ... (then casting _mm_castsi128_ps) would be a more robust alternative. Otherwise, MSVC pragmas are an option.
+1 for this. It is better to make it work on compilers including msvc :)
Otherwise, MSVC pragmas are an option.
Which pragmas may I ask? How to make it better with msvc pragmas on msvc? A PR to make improvements as you mentioned would also be awesome :)
Finally had time to examine this, fp:fast
is folding both expressions into a single constant:
x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);
Resulting in with fp:fast
:
movaps xmm15, XMMWORD PTR __xmm@80000000000000008000000000000000
// ...
xorps xmm4, xmm15
// ...
xorps xmm13, xmm15
// ...
xorps xmm14, xmm15
// ...
xorps xmm12, xmm15
Without fp:fast
:
shufps xmm0, xmm10, 153 ; 00000099H
movaps xmm11, xmm4
movaps XMMWORD PTR x9$1$[rsp], xmm0
// ...
xorps xmm4, xmm10
// ...
xorps xmm14, xmm10
// ...
xorps xmm15, XMMWORD PTR x9$1$[rsp]
// ...
xorps xmm13, XMMWORD PTR x9$1$[rsp]
Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0))
seems to solve this. Although, there may be some superior alternative that I am unaware of (and other cases, esp. after inlining).
Which pragmas may I ask?
For reference: pragma(float_control(precise, on, push))
+ __pragma(float_control(pop))
. Although, I probably would not go this route.
@gottfriedleibniz many thanks
Modifying the function to use _mm_castsi128_ps(_mm_set_epi32(0x80000000, 0, 0x80000000, 0)) seems to solve this.
any chance to get a PR? or I'll do it asap
For reference: pragma(float_control(precise, on, push)) + __pragma(float_control(pop)). Although, I probably would not go this route.
thanks
Finally had time to examine this, fp:fast is folding both expressions into a single constant:
x8 = _mm_set_ps(-0.f, 0.f, -0.f, 0.f);
x9 = glmm_shuff1(x8, 2, 1, 2, 1);
also to avoid similar scenarios, any idea why fp:fast
option doing this since both x8 and x9 are used in different places?
any chance to get a PR? or I'll do it asap
You can.
any idea why fp:fast option doing this since both x8 and x9 are used in different places?
I suppose some byproduct of subexpression elimination with fp:fast
not having to honour signed zeros. I have not looked into this any further.
@gottfriedleibniz sure, I'll asap, thanks for the infos
@gottfriedleibniz I created a PR for this, it would be nice to get a review: #291
@deadwanderer is it possible to test that PR with /fp:fast
to validate that is working as expected (previous tests)?
thanks
Here are my results using that PR's changes:
cglm arch: x86 SSE*
Identity matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 0.00000 |
| 0.00000 1.00000 0.00000 0.00000 |
| 0.00000 0.00000 1.00000 0.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Translated matrix:
Matrix (float4x4):
| 1.00000 0.00000 0.00000 2.00000 |
| 0.00000 1.00000 0.00000 5.00000 |
| 0.00000 0.00000 1.00000 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Model matrix:
Matrix (float4x4):
| 0.94142 0.11716 0.31623 2.00000 |
| 0.11716 0.76569 -0.63246 5.00000 |
| -0.31623 0.63246 0.70711 -15.00000 |
| 0.00000 0.00000 0.00000 1.00000 |
Inverse matrix:
Matrix (float4x4):
| 0.94142 0.11716 -0.31623 -7.21205 |
| 0.11716 0.76569 0.63246 5.42409 |
| 0.31623 -0.63246 0.70711 13.13643 |
| -0.00000 0.00000 -0.00000 1.00000 |
Transpose matrix:
Matrix (float4x4):
| 0.94142 0.11716 0.31623 -0.00000 |
| 0.11716 0.76569 -0.63246 0.00000 |
| -0.31623 0.63246 0.70711 -0.00000 |
| -7.21205 5.42409 13.13643 1.00000 |
Looks good to me! Thanks so much, @recp and @gottfriedleibniz!
Seems good. Passes my tests.
@deadwanderer @gottfriedleibniz many thanks, that PR is merged.