Intrinsics for ARM?

Question

Intrinsics for ARM?

DonFlymoor opened this issue 4 years ago · comments

Currently, FS runs as 7 FPS for the titanic on a Raspberry Pi 4, and I believe that could be significantly improved with ARM intrinsics. If you are interested, here are some links for ARM intrensics functions and ways for implementation.

https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/neon-programmers-guide-for-armv8-a/optimizing-c-code-with-neon-intrinsics/single-page
https://developer.arm.com/docs/ddi0596/h/simd-and-floating-point-instructions-alphabetic-order
https://developer.arm.com/architectures/instruction-sets/simd-isas/neon/intrinsics
https://community.arm.com/developer/tools-software/oss-platforms/b/android-blog/posts/arm-neon-programming-quick-reference

I will attempt to implement them myself, but I do not have much experience with intrinsic, so I'm not sure how far I'll get.
I will be working on the multithreaded_rendering branch, as that is the only one working with ARM currently,

Gabriele Giuseppini · Answer 1 · Fri Jul 24 2020 18:57:59 GMT+0800 (China Standard Time)

Hi,
Intrinsics, at this moment, are only used in two places in FS:

In the light diffusion algorithm
In discretization of world X coordinates into indices in the ocean surface and ocean land arrays

For the former, the implementation of light diffusion with x86 intrinsics provides a very nice speedup compared with a pure C++ implementation; however, this speedup is only visible when a ship contains light sources. Ships like the "plain" Titanic do not contain light sources, hence there wouldn't be any speedup there.

For the latter, using a single intrinsic instruction in place of the std::floor+cast equivalent provides a nice speedup across the board, but that's in the order of a few frames per sec.

In synthesis:

I currently plan to implement the light diffusion algorithm in C++ for non-X86 architectures (i.e. for ARM), unless you notice that ships with lights are unbearably slow compared to when they have no lights (which you may simulate by setting light diffusion to zero, which effectively turns off the whole algorithm altogether)
I plan to use ARM intrinsics for the second use-case (X discretization), but then tell me first: what are you using in GameMath::FastTruncateInt64/32? If you're using my original code, then that guy has an SSE intrinsic in it at the moment (I'll switch that to an #ifdef soon), which makes me wonder - how can the code even run on your box? Is it because of SIMDE? If that's the case, could you find a disassembly of that part of the code and see how SIMDE ended up compiling it? It's likely using ARM intrinsics already, I suspect...

Don Flymoor · Answer 2 · Fri Jul 24 2020 21:04:51 GMT+0800 (China Standard Time)

SIMDE is a translation program, replacing x86 intrinsic calls with the equivalent ARM intrinsic calls, so x86 intrensics should "just work" with on ARM. I reality, there are a couple on functions which must be replaced, although those are few and far between.
Are there any planes to replace the spring's math with intrensics instead of C++? If that's possible, it could speed things up a bit.
I'm not sure if it's the lights on the titanic or just the sheer number of springs which is slowing the simulation to a crawl. I'm using a 4 core 1.5 Ghz Broadcom ARM CPU, how does that compare with your specs?

Sidenote: Perhaps you should include some of your system specifications in README.md, so that people can figure how fast FS will run on their system, such as CPU speed, GPU speed, and RAM amount

Gabriele Giuseppini · Answer 3 · Sat Jul 25 2020 01:20:30 GMT+0800 (China Standard Time)

Good questions. Here's the deal with performance:

The bottleneck at the moment is, indeed, the spring relaxation algorithm. That requires about 80% of the time spent for simulation of a single frame. I have an alternative version of the same algorithm written with intrinsics in the Benchmarks project, which shows a 27% perf improvement. Sooner or later I'll integrate that in the game, but it's not gonna be a...game changer (pun intended). I might bump up this work item quite soon though, your box might benefit from it a lot.
I plan to revisit the spring relaxation algorithm altogether after the next two major versions (see roadmap at https://gamejolt.com/games/floating-sandbox/353572/devlog/the-future-of-floating-sandbox-cdk2c9yi). There is a different family of algorithms based on minimization of potential energy, which supposedly requires less iterations and on top of that is easily parallelizable - the current iterative algorithm is not (easily) parallelize-able.
This said, in the current implementation, what matters the most is CPU speed - the whole simulation is basically single-threaded (some small steps are parallel, but they're puny compared with the spring relaxation). My laptop is a single-core, 2.2GHz Intel box, and the plain Titanic runs at ~22 FPS.

Good idea on the system specs, I'm updating the readme right now.

This said, can you check the FPS rate you get on the two Titanic's - "R.M.S. Titanic" and "R.M.S. Titanic (With Power)"? This will tell us whether the intrinsics are the bottleneck or not.

Don Flymoor · Answer 4 · Sat Jul 25 2020 02:10:58 GMT+0800 (China Standard Time)

That sounds great! I'll try the normal Titanic and report the difference as soon as I can.

Don Flymoor · Answer 5 · Sat Jul 25 2020 02:27:47 GMT+0800 (China Standard Time)

Would overlaying a light texture give the same results as light diffusion? When I observed the lights on the titanic, I assumed that it was an overlaid glare texture... what is the difference between a glare texture and light diffusion?

Gabriele Giuseppini · Answer 6 · Sat Jul 25 2020 19:16:57 GMT+0800 (China Standard Time)

The lights in FS are calculated on the CPU. At each frame, for each lamp, I visit all the points in the ship and add to each a "light quantity" resulting from that lamp, taking into account distance, lighting parameters, etc.

You are making a very good point. I could definitely achieve the same effect as light diffusion using instead a pre-calculated light texture, except for two things:

Each light in FS technically has "personal characteristics" that make it unique - there are different lamp types in FS, and a single ship could have lights of different types. Of course this could be worked around by having N pre-calculated textures, one for each distinct type of light on the ship at the moment.
Ships could have any number of lamps of them. Rendering a texture quad for each lamp would probably be worse than the light diffusion algorithm itself (I'm quite sure it would on old cards), but this might not be true on newer cards and I might want to revisit it sooner or later.

Note that point # 2 is also the reason why I'm not doing this in a shader: a "light" shader would have to take as input a (possibly large, in the 100's) array of lamp positions, and calculate lighting for each. Now, old cards (like mine) only support uniform arrays up to 100 elements, and thus I'd have to chop up calculations in batches of 100's, complicating the whole shebang quite a lot.

In any case, I have in my roadmap (which is at FloatingSandbox TODO.txt in the source repo, btw, you may check it out!) an item to check whether lighting may be done better on the GPU. May be if I detect newer cards I may shift the calculations to a shader, making heavy use of instanced drawing. I'll definitely revisit this once I have the new box (which comes with an NVidia Quadro, so I may play with "modern" OpenGL!).

Finally, lighting diffusion accounts only for a tiny percentage of the simulation time spent in a frame, depending on how many lights a given ship has. The elephant in the room is, as we've discussed already, the ship relaxation algorithm. Any energy I want to spend on optimizing will first go there, for sure :-) As a matter of fact, these discussions we're having prompted me to bump up the priority of writing the (current) algorithm with intrinsics - might come soon.

In the meantime, I would like to make it so that you don't need SIMDE to build on ARM. I'll be working on this in the next few days. To this end, I would need to know how GCC is compiling one of the functions on your box. Would you be able to help me with that? If so I'll ask you to set a breakpoint at a specific location (I'll tell you how) and to disassemble the code once you hit that. I'll definitely guide you step-by-step. Speaking of which, do you have a Discord account by any chance? It would be easier to do "interactive sessions" such these there rather than via GitHub tickets :-)

Don Flymoor · Answer 7 · Sat Jul 25 2020 23:11:32 GMT+0800 (China Standard Time)

Sure, I'd be glad to help! I don't use GCC, I use Clang, because GCC will not compile FS for some reason. I'm afraid that I don't have access to Discord, but this link may work: https://alpha.sandstorm.io/shared/11sbnmNCKTdij2YP6i3ySMYMbgXxtD_4Pq1-2c9OT2s

Gabriele Giuseppini · Answer 8 · Sun Jul 26 2020 01:44:47 GMT+0800 (China Standard Time)

Perfect!!! Sure, Clang is fine - I really wanna see how SIMDE gets compiled by whatever compiler you use. Let's get started then:

Start Floating Sandbox under gdb (I'd normally run gdb --args FloatingSandbox)
gdb should start in "break" mode; if so type break OceanFloor.h:66, enter, and then r followed again by enter
Floating Sandbox should start running, and at the very first frame it should break in gdb. If so, type u followed by enter, which should print a disassembly. Copy'n'paste the first - say, 2 - pages, and send these to me. Let me know if any of the steps above don't work as I expect them to work.

Don Flymoor · Answer 9 · Sun Jul 26 2020 03:04:58 GMT+0800 (China Standard Time)

This is the SIMDE part of the debug log: r_.f32[0] = *mem_addr;

in

Thread 1 "FloatingSandbox" hit Breakpoint 1, Physics::OceanFloor::GetHeightAt (
    this=0xd73d50, x=-73)
    at Floating-Sandbox/trunk/Game/OceanFloor.h:66
66	        float const sampleIndexF = (x + GameParameters::HalfMaxWorldWidth) / Dx;
(gdb) u

Thread 1 "FloatingSandbox" hit Breakpoint 2, simde_mm_load_ss (
    mem_addr=<optimized out>)
    at Floating-Sandbox/trunk/GameCore/../GameCore/simde/simde/x86/sse.h:2242
2242	    r_.f32[0] = *mem_addr;
(gdb) u
FastTruncateInt64 (value=994.099182)
    at Floating-Sandbox/trunk/GameCore/../GameCore/GameMath.h:36
36	    return _mm_cvttss_si64(_mm_load_ss(&value));
(gdb) u
Physics::OceanFloor::GetHeightAt (this=0xd73d50, x=<optimized out>)
    at Floating-Sandbox/trunk/Game/OceanFloor.h:72
72	        float const sampleIndexDx = sampleIndexF - sampleIndexI;
(gdb) u
77	        return mSamples[sampleIndexI].SampleValue

Gabriele Giuseppini · Answer 10 · Sun Jul 26 2020 03:41:17 GMT+0800 (China Standard Time)

Thanks - but that's just the _mm_load_ss part of GameMath.h:36, I need to see what's immediately after, i.e. how simde ended up compiling _mm_cvttss_si64. It should go through some SIMDE_CONVERT_FTOI and simde_math_roundf calls.

To this end, I just realize my instructions might have been wrong; can you replace u in point 3 above with disassemble?

Don Flymoor · Answer 11 · Sun Jul 26 2020 07:53:34 GMT+0800 (China Standard Time)

Ok, here is the output:

Thread 1 "FloatingSandbox" hit Breakpoint 1, Physics::OceanFloor::GetHeightAt (
    this=0xd74e10, x=-75)
    at Floating-Sandbox/trunk/Game/OceanFloor.h:66
66	        float const sampleIndexF = (x + GameParameters::HalfMaxWorldWidth) / Dx;
(gdb) disassemble
Dump of assembler code for function Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&):
   0x001b444c <+0>:	push	{r4, r5, r6, r7, r8, r9, r10, r11, lr}
   0x001b4450 <+4>:	add	r11, sp, #28
   0x001b4454 <+8>:	sub	sp, sp, #4
   0x001b4458 <+12>:	vpush	{d8-d15}
   0x001b445c <+16>:	sub	sp, sp, #8
   0x001b4460 <+20>:	ldr	r10, [r0, #52]	; 0x34
   0x001b4464 <+24>:	str	r0, [sp, #4]
   0x001b4468 <+28>:	cmp	r10, #0
   0x001b446c <+32>:	beq	0x1b46f4 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+680>
   0x001b4470 <+36>:	vldr	s0, [pc, #652]	; 0x1b4704 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+696>
   0x001b4474 <+40>:	vldr	s2, [r1]
   0x001b4478 <+44>:	vldr	s6, [r1, #232]	; 0xe8
   0x001b447c <+48>:	vldr	s4, [r1, #228]	; 0xe4
   0x001b4480 <+52>:	vldr	s22, [pc, #644]	; 0x1b470c <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+704>
   0x001b4484 <+56>:	vdiv.f32	s16, s0, s2
   0x001b4488 <+60>:	vldr	s0, [pc, #632]	; 0x1b4708 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+700>
   0x001b448c <+64>:	vneg.f32	s20, s4
--Type <RET> for more, q to quit, c to continue without paging--c
   0x001b4490 <+68>:	vldr	s24, [pc, #632]	; 0x1b4710 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+708>
   0x001b4494 <+72>:	vldr	s26, [pc, #632]	; 0x1b4714 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+712>
   0x001b4498 <+76>:	vsub.f32	s18, s0, s6
   0x001b449c <+80>:	vldr	s28, [pc, #628]	; 0x1b4718 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+716>
   0x001b44a0 <+84>:	vldr	s30, [pc, #628]	; 0x1b471c <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+720>
   0x001b44a4 <+88>:	vldr	s17, [pc, #628]	; 0x1b4720 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+724>
   0x001b44a8 <+92>:	vldr	s19, [pc, #628]	; 0x1b4724 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+728>
   0x001b44ac <+96>:	vldr	s21, [pc, #628]	; 0x1b4728 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+732>
   0x001b44b0 <+100>:	mov	r7, #0
   0x001b44b4 <+104>:	ldr	r0, [sp, #4]
   0x001b44b8 <+108>:	ldr	r4, [r0, #108]	; 0x6c
   0x001b44bc <+112>:	ldr	r6, [r0, #20]
   0x001b44c0 <+116>:	ldrb	r0, [r4, r7]!
   0x001b44c4 <+120>:	ldrb	r3, [r4, #3]
   0x001b44c8 <+124>:	ldrb	r2, [r4, #2]
   0x001b44cc <+128>:	ldrb	r1, [r4, #1]
   0x001b44d0 <+132>:	orr	r2, r2, r3, lsl #8
   0x001b44d4 <+136>:	orr	r0, r0, r1, lsl #8
   0x001b44d8 <+140>:	orr	r0, r0, r2, lsl #16
   0x001b44dc <+144>:	ldrb	r9, [r4, #5]
   0x001b44e0 <+148>:	vmov	s23, r0
   0x001b44e4 <+152>:	vcmpe.f32	s23, s22
   0x001b44e8 <+156>:	vmrs	APSR_nzcv, fpscr
   0x001b44ec <+160>:	vmov.f32	s0, s23
   0x001b44f0 <+164>:	vmovlt.f32	s0, s22
   0x001b44f4 <+168>:	vcmpe.f32	s0, s24
   0x001b44f8 <+172>:	vmrs	APSR_nzcv, fpscr
   0x001b44fc <+176>:	vmovgt.f32	s0, s24
=> 0x001b4500 <+180>:	vmul.f32	s27, s0, s26
   0x001b4504 <+184>:	vadd.f32	s25, s27, s28
   0x001b4508 <+188>:	vmov	r0, s25
   0x001b450c <+192>:	bl	0x2420c <__aeabi_f2lz@plt>
   0x001b4510 <+196>:	mov	r5, r0
   0x001b4514 <+200>:	bl	0x23720 <__aeabi_l2f@plt>
   0x001b4518 <+204>:	vmov	s0, r0
   0x001b451c <+208>:	mov	r0, #656	; 0x290
   0x001b4520 <+212>:	orr	r0, r0, #65536	; 0x10000
   0x001b4524 <+216>:	mov	r1, r4
   0x001b4528 <+220>:	ldr	r8, [r6, r0]
   0x001b452c <+224>:	ldrb	r2, [r1, #4]!
   0x001b4530 <+228>:	add	r0, r8, r5, lsl #3
   0x001b4534 <+232>:	vsub.f32	s0, s25, s0
   0x001b4538 <+236>:	vldr	s4, [r0, #4]
   0x001b453c <+240>:	vldr	s2, [r0]
   0x001b4540 <+244>:	ldrb	r0, [r1, #2]
   0x001b4544 <+248>:	ldrb	r1, [r1, #3]
   0x001b4548 <+252>:	orr	r0, r0, r1, lsl #8
   0x001b454c <+256>:	vmul.f32	s0, s0, s4
   0x001b4550 <+260>:	orr	r1, r2, r9, lsl #8
   0x001b4554 <+264>:	orr	r0, r1, r0, lsl #16
   0x001b4558 <+268>:	vmov	s25, r0
   0x001b455c <+272>:	vadd.f32	s29, s0, s2
   0x001b4560 <+276>:	vcmpe.f32	s25, s29
   0x001b4564 <+280>:	vmrs	APSR_nzcv, fpscr
   0x001b4568 <+284>:	bgt	0x1b46e8 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+668>
   0x001b456c <+288>:	vadd.f32	s27, s27, s30
   0x001b4570 <+292>:	vmov	r0, s27
   0x001b4574 <+296>:	bl	0x2420c <__aeabi_f2lz@plt>
   0x001b4578 <+300>:	mov	r5, r0
   0x001b457c <+304>:	bl	0x23720 <__aeabi_l2f@plt>
   0x001b4580 <+308>:	vmov	s0, r0
   0x001b4584 <+312>:	add	r1, r8, r5, lsl #3
   0x001b4588 <+316>:	vldr	s2, [r1]
   0x001b458c <+320>:	vldr	s4, [r1, #4]
   0x001b4590 <+324>:	ldr	r0, [sp, #4]
   0x001b4594 <+328>:	vmov.f32	s8, s19
   0x001b4598 <+332>:	vsub.f32	s2, s2, s29
   0x001b459c <+336>:	vsub.f32	s0, s27, s0
   0x001b45a0 <+340>:	ldr	r2, [r0, #124]	; 0x7c
   0x001b45a4 <+344>:	ldrb	r12, [r2, r7]!
   0x001b45a8 <+348>:	mov	r0, r2
   0x001b45ac <+352>:	ldrb	r6, [r2, #3]
   0x001b45b0 <+356>:	vmul.f32	s0, s4, s0
   0x001b45b4 <+360>:	ldrb	r3, [r2, #2]
   0x001b45b8 <+364>:	ldrb	lr, [r2, #1]
   0x001b45bc <+368>:	orr	r8, r3, r6, lsl #8
   0x001b45c0 <+372>:	ldrb	r6, [r0, #4]!
   0x001b45c4 <+376>:	ldrb	r5, [r2, #5]
   0x001b45c8 <+380>:	ldrb	r1, [r0, #2]
   0x001b45cc <+384>:	vadd.f32	s4, s2, s0
   0x001b45d0 <+388>:	ldrb	r0, [r0, #3]
   0x001b45d4 <+392>:	orr	r3, r12, lr, lsl #8
   0x001b45d8 <+396>:	orr	r0, r1, r0, lsl #8
   0x001b45dc <+400>:	orr	r1, r6, r5, lsl #8
   0x001b45e0 <+404>:	vmul.f32	s6, s4, s4
   0x001b45e4 <+408>:	orr	r3, r3, r8, lsl #16
   0x001b45e8 <+412>:	orr	r0, r1, r0, lsl #16
   0x001b45ec <+416>:	vmov	s0, r3
   0x001b45f0 <+420>:	vmov	s2, r0
   0x001b45f4 <+424>:	vadd.f32	s10, s6, s17
   0x001b45f8 <+428>:	vmov.f32	s6, s19
   0x001b45fc <+432>:	vcmpe.f32	s10, #0.0
   0x001b4600 <+436>:	vmrs	APSR_nzcv, fpscr
   0x001b4604 <+440>:	ble	0x1b4614 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+456>
   0x001b4608 <+444>:	vsqrt.f32	s8, s10
   0x001b460c <+448>:	vdiv.f32	s6, s21, s8
   0x001b4610 <+452>:	vdiv.f32	s8, s4, s8
   0x001b4614 <+456>:	vmul.f32	s4, s6, s2
   0x001b4618 <+460>:	vmul.f32	s10, s8, s0
   0x001b461c <+464>:	vadd.f32	s4, s10, s4
   0x001b4620 <+468>:	vcmpe.f32	s4, #0.0
   0x001b4624 <+472>:	vmrs	APSR_nzcv, fpscr
   0x001b4628 <+476>:	ble	0x1b46e8 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+668>
   0x001b462c <+480>:	vmul.f32	s6, s4, s6
   0x001b4630 <+484>:	vmul.f32	s4, s4, s8
   0x001b4634 <+488>:	vmul.f32	s10, s0, s16
   0x001b4638 <+492>:	vmul.f32	s8, s2, s16
   0x001b463c <+496>:	vsub.f32	s2, s2, s6
   0x001b4640 <+500>:	vsub.f32	s0, s0, s4
   0x001b4644 <+504>:	vsub.f32	s10, s23, s10
   0x001b4648 <+508>:	vmul.f32	s4, s4, s20
   0x001b464c <+512>:	vmul.f32	s6, s6, s20
   0x001b4650 <+516>:	vsub.f32	s8, s25, s8
   0x001b4654 <+520>:	vmul.f32	s2, s2, s18
   0x001b4658 <+524>:	vmul.f32	s0, s0, s18
   0x001b465c <+528>:	vmov	r0, s10
   0x001b4660 <+532>:	vmov	r1, s8
   0x001b4664 <+536>:	vadd.f32	s2, s2, s6
   0x001b4668 <+540>:	vadd.f32	s0, s0, s4
   0x001b466c <+544>:	lsr	r6, r0, #16
   0x001b4670 <+548>:	strb	r6, [r4, #2]
   0x001b4674 <+552>:	lsr	r3, r0, #24
   0x001b4678 <+556>:	vmov	r5, s2
   0x001b467c <+560>:	vmov	r6, s0
   0x001b4680 <+564>:	lsr	lr, r0, #8
   0x001b4684 <+568>:	strb	r0, [r4]
   0x001b4688 <+572>:	strb	r3, [r4, #3]
   0x001b468c <+576>:	strb	lr, [r4, #1]
   0x001b4690 <+580>:	lsr	r12, r1, #24
   0x001b4694 <+584>:	strb	r1, [r4, #4]!
   0x001b4698 <+588>:	lsr	r0, r1, #16
   0x001b469c <+592>:	lsr	r9, r1, #8
   0x001b46a0 <+596>:	strb	r0, [r4, #2]
   0x001b46a4 <+600>:	strb	r12, [r4, #3]
   0x001b46a8 <+604>:	strb	r9, [r4, #1]
   0x001b46ac <+608>:	lsr	r1, r6, #24
   0x001b46b0 <+612>:	lsr	r0, r6, #16
   0x001b46b4 <+616>:	lsr	r3, r6, #8
   0x001b46b8 <+620>:	strb	r6, [r2]
   0x001b46bc <+624>:	mov	r6, r2
   0x001b46c0 <+628>:	lsr	lr, r5, #8
   0x001b46c4 <+632>:	lsr	r8, r5, #16
   0x001b46c8 <+636>:	strb	r5, [r6, #4]!
   0x001b46cc <+640>:	lsr	r5, r5, #24
   0x001b46d0 <+644>:	strb	r5, [r6, #3]
   0x001b46d4 <+648>:	strb	r8, [r6, #2]
   0x001b46d8 <+652>:	strb	lr, [r2, #5]
   0x001b46dc <+656>:	strb	r1, [r2, #3]
   0x001b46e0 <+660>:	strb	r0, [r2, #2]
   0x001b46e4 <+664>:	strb	r3, [r2, #1]
   0x001b46e8 <+668>:	subs	r10, r10, #1
   0x001b46ec <+672>:	add	r7, r7, #8
   0x001b46f0 <+676>:	bne	0x1b44b4 <Physics::Ship::HandleCollisionsWithSeaFloor(GameParameters const&)+104>
   0x001b46f4 <+680>:	sub	sp, r11, #96	; 0x60
   0x001b46f8 <+684>:	vpop	{d8-d15}
   0x001b46fc <+688>:	add	sp, sp, #4
   0x001b4700 <+692>:	pop	{r4, r5, r6, r7, r8, r9, r10, r11, pc}
   0x001b4704 <+696>:	bcc	0xd65404
   0x001b4708 <+700>:	svccc	0x00800000
   0x001b470c <+704>:	ldrgt	r4, [r12, #-0]
   0x001b4710 <+708>:	ldrmi	r4, [r12, #-0]
   0x001b4714 <+712>:	mrccc	7, 6, r11, cr1, cr7, {0}
   0x001b4718 <+716>:	strmi	r0, [r0], #0
   0x001b471c <+720>:	strmi	r0, [r0], #34	; 0x22
   0x001b4720 <+724>:	ldmcc	r1, {r0, r1, r2, r4, r8, r9, r10, r12, sp, pc}^
   0x001b4724 <+728>:	andeq	r0, r0, r0
   0x001b4728 <+732>:	stclt	7, cr13, [r3], #-40	; 0xffffffd8
End of assembler dump.

Gabriele Giuseppini · Answer 12 · Mon Jul 27 2020 07:08:59 GMT+0800 (China Standard Time)

Thank you! That made it clear that the compilation of FastTruncateXXX() was done by invoking an ARM runtime function (__aeabi_f2lz), which is equivalent to casting a float to an intxxx - it looks like there's no assembly instruction in ARM to convert a float to an integer directly. I'll implement them like this for ARM then.
I'll then write a C++ implementation of the lighting diffusion algorithm to be used in the ARM case, after which you won't need simde anymore. Stay tuned.

Gabriele Giuseppini · Answer 13 · Mon Jul 27 2020 20:54:15 GMT+0800 (China Standard Time)

Hello, I've just pushed my last change on this matter to master. You may now remove simde completely, pull clean from master, and build. Let me know whether you have any compile-time issues or not; if everything's fine I'll close the ticket.

Don Flymoor · Answer 14 · Mon Jul 27 2020 21:41:21 GMT+0800 (China Standard Time)

It looks like it didn't quite work...

/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:43:5: error: unknown type name '__m128'                                                __m128 const Zero = _mm_setzero_ps();                                                                                                                                                                                                        ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:43:12: error: expected unqualified-id                                                  __m128 const Zero = _mm_setzero_ps();                                                                                                                                                                                                               ^                                                                                                                                                                                                                                 /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:45:5: error: unknown type name '__m128'                                                __m128 _v = _mm_castpd_ps(_mm_load_sd(reinterpret_cast<double const * restrict>(&v)));                                                                                                                                                       ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:45:31: error: use of undeclared identifier '_mm_load_sd'                               __m128 _v = _mm_castpd_ps(_mm_load_sd(reinterpret_cast<double const * restrict>(&v)));                                                                                                                                                                                 ^                                                                                                                                                                                                              /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:46:5: error: unknown type name '__m128'                                                __m128 _l = _mm_set1_ps(length);                                                                                                                                                                                                             ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:46:17: error: use of undeclared identifier '_mm_set1_ps'                               __m128 _l = _mm_set1_ps(length);                                                                                                                                                                                                                         ^                                                                                                                                                                                                                            /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:47:5: error: unknown type name '__m128'                                                __m128 _r = _mm_div_ps(_v, _l);                                                                                                                                                                                                              ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:48:5: error: unknown type name '__m128'                                                __m128 validMask = _mm_cmpneq_ps(_l, Zero);                                                                                                                                                                                                  ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:48:42: error: use of undeclared identifier 'Zero'                                      __m128 validMask = _mm_cmpneq_ps(_l, Zero);                                                                                                                                                                                                                                       ^                                                                                                                                                                                                   /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:89:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                     vec2f norm = Algorithms::NormalizeVector2(vectors[i], lengths[i]);                                                                                                                                                                                        ~~~~~~~~~~~~^                                                                                                                                                                                                       /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:157:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                    vec2f norm = Algorithms::NormalizeVector2(vectors[i], lengths[i]);                                                                                                                                                                                        ~~~~~~~~~~~~^                                                                                                                                                                                                       /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:224:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                    vec2f norm = Algorithms::NormalizeVector2(vectors[i]);                                                                                                                                                                                                    ~~~~~~~~~~~~^                                                                                                                                                                                                       /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:266:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                    vec2f norm = Algorithms::NormalizeVector2(vectors[i]);                                                                                                                                                                                                    ~~~~~~~~~~~~^                             /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:43:5: error: unknown type name '__m128'                                                __m128 const Zero = _mm_setzero_ps();                                                                                                                                                                                                        ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:43:12: error: expected unqualified-id                                                  __m128 const Zero = _mm_setzero_ps();                                                                                                                                                                                                               ^                                                                                                                                                                                                                                 /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:45:5: error: unknown type name '__m128'                                                __m128 _v = _mm_castpd_ps(_mm_load_sd(reinterpret_cast<double const * restrict>(&v)));                                                                                                                                                       ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:45:31: error: use of undeclared identifier '_mm_load_sd'                               __m128 _v = _mm_castpd_ps(_mm_load_sd(reinterpret_cast<double const * restrict>(&v)));                                                                                                                                                                                 ^                                                                                                                                                                                                              /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:46:5: error: unknown type name '__m128'                                                __m128 _l = _mm_set1_ps(length);                                                                                                                                                                                                             ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:46:17: error: use of undeclared identifier '_mm_set1_ps'                               __m128 _l = _mm_set1_ps(length);                                                                                                                                                                                                                         ^                                                                                                                                                                                                                            /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:47:5: error: unknown type name '__m128'                                                __m128 _r = _mm_div_ps(_v, _l);                                                                                                                                                                                                              ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:48:5: error: unknown type name '__m128'                                                __m128 validMask = _mm_cmpneq_ps(_l, Zero);                                                                                                                                                                                                  ^                                                                                                                                                                                                                                        /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:48:42: error: use of undeclared identifier 'Zero'                                      __m128 validMask = _mm_cmpneq_ps(_l, Zero);                                                                                                                                                                                                                                       ^                                                                                                                                                                                                   /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:89:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                     vec2f norm = Algorithms::NormalizeVector2(vectors[i], lengths[i]);                                                                                                                                                                                        ~~~~~~~~~~~~^                                                                                                                                                                                                       /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:157:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                    vec2f norm = Algorithms::NormalizeVector2(vectors[i], lengths[i]);                                                                                                                                                                                        ~~~~~~~~~~~~^                                                                                                                                                                                                       /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:224:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                    vec2f norm = Algorithms::NormalizeVector2(vectors[i]);                                                                                                                                                                                                    ~~~~~~~~~~~~^                                                                                                                                                                                                       /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:266:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'                    vec2f norm = Algorithms::NormalizeVector2(vectors[i]);                                                                                                                                                                                                    ~~~~~~~~~~~~^                             /path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:43:5: error: unknown type name '__m128'
    __m128 const Zero = _mm_setzero_ps();
    ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:43:12: error: expected unqualified-id
    __m128 const Zero = _mm_setzero_ps();
           ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:45:5: error: unknown type name '__m128'
    __m128 _v = _mm_castpd_ps(_mm_load_sd(reinterpret_cast<double const * restrict>(&v)));
    ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:45:31: error: use of undeclared identifier '_mm_load_sd'
    __m128 _v = _mm_castpd_ps(_mm_load_sd(reinterpret_cast<double const * restrict>(&v)));
                              ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:46:5: error: unknown type name '__m128'
    __m128 _l = _mm_set1_ps(length);
    ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:46:17: error: use of undeclared identifier '_mm_set1_ps'
    __m128 _l = _mm_set1_ps(length);
                ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:47:5: error: unknown type name '__m128'
    __m128 _r = _mm_div_ps(_v, _l);
    ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:48:5: error: unknown type name '__m128'
    __m128 validMask = _mm_cmpneq_ps(_l, Zero);
    ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:48:42: error: use of undeclared identifier 'Zero'
    __m128 validMask = _mm_cmpneq_ps(_l, Zero);
                                         ^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:89:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'
            vec2f norm = Algorithms::NormalizeVector2(vectors[i], lengths[i]);
                         ~~~~~~~~~~~~^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:157:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'
            vec2f norm = Algorithms::NormalizeVector2(vectors[i], lengths[i]);
                         ~~~~~~~~~~~~^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:224:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'
            vec2f norm = Algorithms::NormalizeVector2(vectors[i]);
                         ~~~~~~~~~~~~^
/path/to/Floating-Sandbox/Benchmarks/SingleVectorNormalization.cpp:266:38: error: no member named 'NormalizeVector2' in namespace 'Algorithms'
            vec2f norm = Algorithms::NormalizeVector2(vectors[i]);

I think having a "static" version of SIMDE may be the best choice (as opposed to a submodule, the files from SIMDE would be used, but not as a submodule), as that would provide the easiest way to provide intrensics on x86 as well as on ARM. ( All Simde does it link x86 functions with corresponding ARM functions, and vice versa)

I would reccomend reading the README of simde,to get an idea of what it does, how to use it, and if it's even needed.

Gabriele Giuseppini · Answer 15 · Tue Jul 28 2020 02:00:03 GMT+0800 (China Standard Time)

Oh sure, the "benchmarks" project, I forgot that. I'll take care of it sooner or later, for the time being you may exclude it from the build by editing the root CMakeLists.txt file.

Regarding SIMDE, yes, I know how it works. In my opinion though it's pointless to write code with intrinsics for a platform when these have to be translated to another platform, it defies the whole purpose. The intrinsics I'm using on Intel use SSE-specific instructions to speed up computations, and the code that uses them is structured specifically to tailor the intrinsics (e.g. ability to make vector results based on logical conditions); ARM (NEON) would have its own specific instructions which, if used, would require a different code structure.

Can you let me know how it goes after commenting out line 222 in CMakeLists.txt?

Gabriele Giuseppini · Answer 16 · Tue Jul 28 2020 02:47:58 GMT+0800 (China Standard Time)

FYI, I've pushed a fix to the Benchmark project itself, you may avoid having to modify CMakeLists.txt now. Let me know how it goes this time.

Don Flymoor · Answer 17 · Tue Jul 28 2020 05:34:38 GMT+0800 (China Standard Time)

Regarding SIMDE, yes, I know how it works. In my opinion though it's pointless to write code with intrinsics for a platform when these have to be translated to another platform, it defies the whole purpose. The intrinsics I'm using on Intel use SSE-specific instructions to speed up computations, and the code that uses them is structured specifically to tailor the intrinsics (e.g. ability to make vector results based on logical conditions); ARM (NEON) would have its own specific instructions which, if used, would require a different code structure.
...

Since SIMDE translates INTEL intrensics to ARM intrensics, ARM users would benefit from SIMDE, as opposed to plain C++. For instance, let's take a function to calculate the square root of a number, called _mm_sqrt_pd:

Plain C:

std::sqrt(x);

Execution time: 1 sec

INTEL intensics:

extern __m128d _mm_sqrt_pd(__m128d x);

Execution time: 0.5 sec

ARM intrensics:

float32x2_t vrsqrte_f32 (float32x2_t x)

Execution time: 0.5 sec

(I'm making up the times for lack of compilation)

SIMDE would alias extern __m128d _mm_sqrt_pd(__m128d x);withfloat32x2_t vrsqrte_f32 (float32x2_t x)without having to specify float32x2_t vrsqrte_f32 (float32x2_t x)`, thus making life easier for the person writing Intel intrensics for ARM. (SIMDE also does ARM to INTEL, if you want to test ARM intrensics on an INTEL platform).

Since intrinsics are just a way to avoid the overhead of function calls, then a translator like SIMDE could could help with porting intrensics to ARM.

Of course, as you pointed out, some functions written with INTEL intrinsics can't just be translated to ARM intrensics, so it's not useful in all cases.

Don Flymoor · Answer 18 · Tue Jul 28 2020 05:58:43 GMT+0800 (China Standard Time)

I pulled and recompiled, success!

Gabriele Giuseppini · Answer 19 · Tue Jul 28 2020 06:08:23 GMT+0800 (China Standard Time)

You have definitely a point there. The sqrt example makes sense as almost every processor supports native instructions for basic mathematical functions. The point is that there are other instructions - mostly the ones that deal with multiple lanes in a floating point register - that dictate the structure of the code, and if these instructions are not present in another architecture, this structure of the code might even make it worse.
I'll make an example: Intel has an instruction that rotates the 4 lanes of a 128-bit floating point register. Given that this instruction exists and it's quite fast, I've written the light diffusion algorithm - which essentially needs to calculate N x M times a function (with N the number of particles and M the number of lamps) - in a manner that calculates the function on 4 particles against 4 lamps in a single go - particle 1 against lamp 1, particle 2 against lamp 2, and so on via plain SIMDE vector math - then it rotates the lamps and recalculates the function, and does this 4 times in total, achieving a 4X4 calculation in just 4 iterations. On an architecture where rotation is not supported, any attempt at simulating the rotation will introduce a penalty that will far exceed a simpler implementation of this algorithm, one that simply iterates through particles and lamps. My current light diffusion algorithm is faster on Intel only because of the rotation operator, not because of the square root. This is just an example, I have no idea whether NEON supports rotation - probably it does, but that's not the point.

Conversely, another architecture might offer operators that make it more convenient to structure the calculation in a different way. If that architecture for example supported an operator that calculates the length of a vector (like some GPUs for example, rather than calculating squares, adding them, doing a sqrt, and a division), then I'd write the diffusion algorithm for it differently, trying to take advantage of the operator as much as possible.

I'm not saying that I won't use NEON intrinsics; I will definitely look into them sooner or later, once a bottleneck worth of investigation will be detected - the spring relaxation algorithm, for example. At this moment, light diffusion is definitely not the bottleneck, and the possible perf gain out of SIMDE translating FS's intrinsics is not worth, in my opinion, the hassle to have to take yet another dependency. FS already requires too many dependencies, so much that I'm thinking of getting rid of at least one in the near future.

Gabriele Giuseppini · Answer 20 · Tue Jul 28 2020 06:33:25 GMT+0800 (China Standard Time)

BTW, I was going through your NEON links, they're quite useful, thanks! It's quite hard to code with intrinsics without having the processor at hand :-) Do you know of an emulator I may use in a Linux or Windows environment?

Don Flymoor · Answer 21 · Tue Jul 28 2020 21:25:09 GMT+0800 (China Standard Time)

Thanks! I really don't have much experience with NEON, but I would be happy to try to "intrensicsise" the relaxation algorithm for ARM.

I know of several good emulators for Windows and Linux (But only one will do ARM):

Windows:

Qemu, an extremely powerful emulator, but much more difficult to use (no default gui). Here's an old guide for running Raspberry Pi OS (As Raspbian is called now) on Windows

Linux:

Qemu, an extremely powerful emulator, but much more difficult to use (no default gui). Linux setup guide for ARM on Intel. For a raspberry pi emulation on linux (Keep in mind, you will need to compile wxWidgets form source for FS on Raspberry Pi OS). You could even use windows 10 ARM version with qemu.

Other options:

SIMDE everywhere! Just include the ARM headers, and ARM functions will be translated to INTEL functions. Not technically an emulator, but according to SIMDE you can do that for devoloping ARM intrensics on INTEL. This would be the easiest option.
Get a rbpi! They don't cost much, however, the're not free, so the cost could still be prohibitive (around $100, for the power, case, fan, SD card, micro HDMI to HDMI...). If you do get one, the Pi 4 4G is the way to go (also, don't skimp on the sd card. That is the hard drive, and a slow one will slow the system).

Gabriele Giuseppini · Answer 22 · Fri Jul 31 2020 17:15:25 GMT+0800 (China Standard Time)

You prickled me, I might just buy a Raspberry :-) First I want FS to run spotless on X-Windows though, so I'll wait for my new box to arrive so that I can run a Linux VM on it.

Don Flymoor · Answer 23 · Fri Jul 31 2020 21:36:21 GMT+0800 (China Standard Time)

You may already have the processing power to run Raspbian (or Raspbian Lite), Raspbian is much less resource intensive then other distributions, like Ubuntu or OpenSUSE.

If you want to give ARM intrinsics a go on your windows box, put simde in your working directory, and these lines of code in your program:

#define SIMDE_ENABLE_NATIVE_ALIASES
#include "simde/arm/neon.h"

After that, ARM intrinsic functions will work on INTEL, as they would on ARM.

Evan Nemerson · Answer 24 · Wed Aug 05 2020 10:46:13 GMT+0800 (China Standard Time)

It looks like it didn't quite work...

That looks like you forgot to define SIMDE_ENABLE_NATIVE_ALIASES.

I'll make an example: Intel has an instruction that rotates the 4 lanes of a 128-bit floating point register. Given that this instruction exists and it's quite fast

That's actually pretty fast on most CPUs; it's just a shuffle. I guess you're talking about _mm_alignr_epi8, which isn't really a rotate unless you pass the same value for both arguments, but on AArch64 it's just a vextq_f32 (or u32, or s32, depending on the type). On AltiVec/VSX it's a vec_perm, on WASM wasm_v32x4_shuffle, etc. If all you care about is GCC and clang (and clang-derived) compilers, __builtin_shuffle (GCC) or __builtin_shufflevector will do it on any architecture the compiler supports.

I agree with your point, just thought you may be interested since it sounds like it's pretty important to you.

x86 tends to have lots of really powerful, very specific functions; my favorite example is _mm_maddubs_epi16, but lots of huge gaps in their API where many operations are only supported for a few types (including almost no instructions which operate on unsigned types). NEON (and to a lesser extent AltiVec) tend to mostly support simpler operations, but what they do support works for all types.

The good news is that x86 functions can usually be implemented pretty efficiently by composing a couple NEON functions. Going in the other direction can be more complicated.

This is just an example, I have no idea whether NEON supports rotation - probably it does, but that's not the point.

Just wanted to be clear that I saw this :)

Do you know of an emulator I may use in a Linux or Windows environment?

If you're on Debian (or a Debian-derivative like Ubuntu), they actually have awesome support for cross-compiling and emulation, I'd highly recommend that. There are some links in SIMDe's wiki. Once you get that set up you basically just use a different compiler (GCC) or add some flags (clang) and you can compile and run Arm code pretty much the same as x86 code much faster than an RPi (depending on your CPU, of course).

SIMDe also has a Docker-based development environment based on Debian with loads of emulators and compilers set up which you could probably adapt to this project.

Hope that help. This seems like a kinda noisy issue so I'm not going to subscribe, but feel free to @ me if you have any questions.

Gabriele Giuseppini · Answer 25 · Wed Aug 05 2020 17:11:50 GMT+0800 (China Standard Time)

Hey, thanks so much for chipping in, and for the useful pointers!!!

My strategy will be to provide NEON-specific (and any other architecture-specific) implementations as I go, after installing VMs/emulators support for them on my dev box and verifying with profiling where the bottlenecks really are. I really like to dive into the details of the architectures, I don't mind implementing the same function N times if each time I may take advantage of architecture-specific SIMD tricks :-)