[dxvk] Optimize for the d3d9 Strict float emulation path

Question

[dxvk] Optimize for the d3d9 Strict float emulation path

Blisto91 opened this issue 7 months ago · comments

The d3d9 Strict float emulation path in dxvk (see links below for technical description) is not enabled by default for all drivers, even though it is more correct, because it has a performance penalty compared to the default True.
Radv and now also nvk have code to optimize for this and so will both use Strict out of the box without any performance penalty and with more games functioning out of the box without visual issues.

Amdvlk currently doesn't do this and so will either have a performance penalty for any games where dxvk sets Strict by default or risk of visual issues in any games where such builtin configs doesn't exist yet. A couple of examples for illustrating the performance dip can be seen below.
Note that these games are just randomly chosen and are not meant to be worst case scenarios. Also note that my test setup is pretty high end to begin with (RX6800 and 7950x) and so does not represent a typical one.

Risen

d3d9.floatEmulation = True

d3d9.floatEmulation = Strict

Dragons Dogma

d3d9.floatEmulation = True

d3d9.floatEmulation = Strict

See original dxvk PR doitsujin/dxvk#2294
See also radv MR https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/13436

ruiminzhao · Answer 1 · Mon Feb 19 2024 16:16:16 GMT+0800 (China Standard Time)

@Blisto91 Thanks for your comments. Now I'm investigating this issue on Amdvlk.
A question here:
Do you have any SPIRV generated when "d3d9.floatEmulation = True" or "d3d9.floatEmulation = Strict" ? I want to know if this setting will be reflected in the shader. Then I can do the optimization according to the related flag in SPIRV.
Thanks.

Blisto91 · Answer 2 · Fri Feb 23 2024 17:33:15 GMT+0800 (China Standard Time)

Hi there and thank you for the response. I am not personally skilled in this area but I have asked the dxvk devs for assistance when they have a bit of time.

Georg Lehmann · Answer 3 · Sat Mar 23 2024 17:05:11 GMT+0800 (China Standard Time)

I see GPUOpen-Drivers/llpc@e91a935 added an optimization for ((b==0.0 ? 0.0 : a) * (a==0.0 ? 0.0 : b)). But dxvk also emits fma((b==0.0 ? 0.0 : a), (a==0.0 ? 0.0 : b), c) . So unless llpc lowers fma to mul+add, you should also add a pattern that optimizes the fma version to v_fma_legacy_f32/v_mad_legacy_f32. And depending on if you run constant folding before the optimizations, you also want to handle the case where the comparison+select was optimized away for one mul operand, (a * (a==0.0 ? 0.0 : b))/fma(a, (a==0.0 ? 0.0 : b), c), if b is not constant zero.

ruiminzhao · Answer 4 · Mon Mar 25 2024 01:03:15 GMT+0800 (China Standard Time)

@DadSchoorse Thanks for your comment. Now I have added more patterns as you refer, now the patterns supported is listed below:

((b==0.0 ? 0.0 : a) * (a==0.0 ? 0.0 : b)) ==>fmul_legacy(a,b)
a * (a==0.0?0.0:b) or (b==0.0?0.0:a) * b ==>fmul_legacy(a.b)
fma((b==0.0 ? 0.0 : a), (a==0.0 ? 0.0 : b), c) ==>fma_legacy(a,b,c)
fma(a, (a==0.0 ? 0.0 : b), c) or fma(b==0.0?0.0:a, b, c) ==>fma_legacy(a,b,c)

For 2.3, one more condition is the single operand(a or b) should not be constant zero here.

Please check any missing here. Now my fix is under CI, looking forward to merge and deliver it ASAP.
Thanks.

Georg Lehmann · Answer 5 · Mon Mar 25 2024 03:53:05 GMT+0800 (China Standard Time)

For 2.3, one more condition is the single operand(a or b) should not be constant zero here.

What I've said before may have been a bit ambiguous, so just to make sure: For a * (a==0.0?0.0:b) it's important that b is not zero. So if (b.isConstant() && b.constantValue() != 0.0) { apply_opt(); }, not if (!b.isConstant() || b.constantValue() != 0.0).

Otherwise, your list matches what radv optimizes.

Georg Lehmann · Answer 6 · Mon Mar 25 2024 05:01:40 GMT+0800 (China Standard Time)

Oh, another thing I just thought of, I don't see a bit size check in GPUOpen-Drivers/llpc@e91a935 . v_mul_legacy_f32/v_fma_legacy_f32 are 32bit only.

Blisto91 · Answer 7 · Sun May 19 2024 06:17:09 GMT+0800 (China Standard Time)

Was this work supposed to be enabled in the 2024.Q2.1 release?
I tried a quick test with my iGPU in Risen 1 and still get a big performance drop when setting d3d9.floatEmulation = Strict

AMDVLK 2024.Q2.1

d3d9.floatEmulation = True

d3d9.floatEmulation = Strict

RADV for comparison

d3d9.floatEmulation = True

d3d9.floatEmulation = Strict

ruiminzhao · Answer 8 · Mon May 20 2024 14:42:31 GMT+0800 (China Standard Time)

@Blisto91 Thanks for your feedback. For the cause of the issue, I have two suspects here:

My fix hasn't fit the pattern of IR pattern generate by the game.
Other fix which is related with fastmath flag has broken the pattern on which I have the optimized.

To confirm which one cause this issue, would you please(or ask dxvk devs for assistance) to dump the pipeline then I can check whether my optimization has been effective.

Thanks.

Robin Kertels · Answer 9 · Mon May 20 2024 16:42:23 GMT+0800 (China Standard Time)

You can dump the shaders by setting the environment variable DXVK_SHADER_DUMP_PATH=/your/path and then running the game with DXVK.

That will export the generated SPIR-V among other things.

Any D3D9 game will work, you just need to also set the environment variable DXVK_CONFIG=d3d9.floatEmulation = Strict; to enable the accurate float behavior.

Blisto91 · Answer 10 · Thu May 23 2024 23:23:41 GMT+0800 (China Standard Time)

Linked is a dxvk shader dump from Risen 1 running on my 7950x iGPU with amdvlk 2024.Q2.1 and d3d9.floatEmulation = strict

https://drive.proton.me/urls/SF8RPVZ6CG#Rk7KIIG4d480

ruiminzhao · Answer 11 · Tue May 28 2024 11:54:06 GMT+0800 (China Standard Time)

@Blisto91 Thanks. But unfortunatelly I can't access this link.... Maybe you can add the related files in this page attached?

Blisto91 · Answer 12 · Tue May 28 2024 14:47:45 GMT+0800 (China Standard Time)

@ruiminzhao Hi there. I hope a 7zip wrapped in a zip is fine as most formats Github allows doesn't compress enough on their own.
Risen-amdvlk-strict-float.zip

ruiminzhao · Answer 13 · Tue May 28 2024 15:48:41 GMT+0800 (China Standard Time)

@ruiminzhao Hi there. I hope a 7zip wrapped in a zip is fine as most formats Github allows doesn't compress enough on their own. Risen-amdvlk-strict-float.zip

Thanks. I can get the log now and will have a look later.

ruiminzhao · Answer 14 · Mon Jun 03 2024 10:26:30 GMT+0800 (China Standard Time)

The root cause has been found, the pattern used widely is like:
"
“
%2272 = select reassoc nnan nsz arcp contract afn i1 %2270, float 0.000000e+00, float %2271
%2273 = insertelement <3 x float> poison, float %2272, i64 0
…
%2276 = select reassoc nnan nsz arcp contract afn i1 %2274, float 0.000000e+00, float %2275
%2277 = insertelement <3 x float> %2273, float %2276, i64 1
…
%2280 = select reassoc nnan nsz arcp contract afn i1 %2278, float 0.000000e+00, float %2279
%2281 = insertelement <3 x float> %2277, float %2280, i64 2
..

%2293 = select reassoc nnan nsz arcp contract afn i1 %2291, float 0.000000e+00, float %2292
%2294 = insertelement <3 x float> poison, float %2293, i64 0
…
%2297 = select reassoc nnan nsz arcp contract afn i1 %2295, float 0.000000e+00, float %2296
%2298 = insertelement <3 x float> %2294, float %2297, i64 1
…
%2301 = select reassoc nnan nsz arcp contract afn i1 %2299, float 0.000000e+00, float %2300
%2302 = insertelement <3 x float> %2298, float %2301, i64 2
…
%2303 = fmul reassoc nnan nsz arcp contract afn <3 x float> %2281, %2302
“

"
It hasn't been caught, it needs to reorder the process like this:
"
Many other transforms
Scalarizer pass
fmul_legacy / fma_legacy matching
"