giordano / julia-on-fugaku

Maybe I missed it but you don't seem to explicitly specify the affinity of the Julia processes (i.e. each MPI rank) in the, say, MPI ping pong benchmark (neither here nor in MPIBenchmarks.jl). Do you rely on the job scheduler taking care of that? More importantly, between which CPU-cores do you actually perform the ping-pong measurement?

In ThreadPinning.jl I have implemented a somewhat similar latency measurement for multithreading and it nicely reveals the socket structure of a Noctua 1 node. On Fugaku, it would be interesting to run the same (both for threads and MPI) and see how the NUMA structure manifests itself.

Maybe I missed it but you don't seem to explicitly specify the affinity of the Julia processes (i.e. each MPI rank) in the, say, MPI ping pong benchmark (neither here nor in MPIBenchmarks.jl).

I'm using the flag --mpi "max-proc-per-node=1"

julia-on-fugaku/benchmarks/mpi-point-to-point/job.sh

Line 5 in 91b0799

    
           #PJM --mpi "max-proc-per-node=1"  # Upper limit of number of MPI process created at 1 node

to force having a single rank per node, so that the two processes talk across different nodes (I think I did try running two ranks on the same node and throughput was much higher, as expected).

Do you rely on the job scheduler taking care of that? More importantly, between which CPU-cores do you actually perform the ping-pong measurement?

To answer the first question, in short yes. My understanding from doing a few experiments with it in the past is that by default the Fujitsu scheduler would spread evenly the ranks across multiple nodes even if you don't specify max-proc-per-node, but I explicitly used that for good measure. I tried to follow faithfully the job scripts used in the Riken benchmarks.

Do you know a Julia-way to pin the processes to the different nodes?

In ThreadPinning.jl I have implemented a somewhat similar latency measurement for multithreading and it nicely reveals the socket structure of a Noctua 1 node. On Fugaku, it would be interesting to run the same (both for threads and MPI) and see how the NUMA structure manifests itself.

Here it is (this is again Ookami, but it shouldn't be much different on Fugaku): https://github.com/giordano/julia-on-fugaku/blob/cd5795aa746ec83286dc5d82aefdde50c56f74a3/benchmarks/bandwidthbenchmarkjl/latencies.pdf. But it's a bit different from what you got, is this what you'd expect?

so that the two processes talk across different nodes

Ah, that makes sense. Somehow thought it was an intra-node benchmark. My bad.

Do you know a Julia-way to pin the processes to the different nodes?

No, and I don't think there is/can be. (Again, I was thinking intra-node and in this case there are a bunch of ways including using ThreadPinning.jl on each MPI rank to pin itself to the desired core based on the rank id, for example)

But it's a bit different from what you got, is this what you'd expect?

Yes, I think so. It nicely shows the four-fold structure due to the 4 NUMA domains (and that intra-NUMA domain is fastest). Perhaps I would have expected a hierarchy of speed levels for inter-NUMA but 1) I would have to look at the physical topology of the chip units and 2) maybe this requires fine-tuning of the benchmark and is overshadowed by the much faster intra-NUMA (which sets the color scale).

Yes, I think so. It nicely shows the four-fold structure due to the 4 NUMA domains (and that intra-NUMA domain is fastest).

To be clear, to count the NUMA regions you're just looking at the low-latency diagonal boxes, right? For some reasons I was initially trying to interpret the off-diagonal boxes, which however have a more complex pattern and in hindsight (and after sleeping) I'm not really sure that makes much sense 😅

To be clear, to count the NUMA regions you're just looking at the low-latency diagonal boxes, right?
Well, you see the four-fold structure not just on the diagonal but also in x and y direction (although less pronounced between off-diagonal blocks). But yes.
Of course, it doesn't tell you that it's NUMA domains. Could, in principle, also be different sockets (like for the Noctua 1 example in the ThreadPinning docs linked above) or something else.

For some reasons I was initially trying to interpret the off-diagonal boxes, which however have a more complex pattern and in hindsight (and after sleeping) I'm not really sure that makes much sense 😅

While diagonal (intra-NUMA) vs off-diagonal (inter-NUMA) is the dominant feature, the off-diagonal boxes are also somewhat interesting. As I said above, one might (or might not) be able to find a minor hierarchy of latencies (essentially asking the question whether all inter-NUMA connections are equal or whether there are subtle differences). But one would probably work much harder and improve the benchmark quality. After all, the output matrix should be symmetric which it isn't (although "almost").

Process affinities in MPI point-to-point benchmark