Intel MPI Benchmarks (IMB-MPI1) performance issue? with EFA and Rocky Linux 8 custom image

Question

Intel MPI Benchmarks (IMB-MPI1) performance issue? with EFA and Rocky Linux 8 custom image

panda1100 opened this issue 6 months ago · comments

Required Info:

AWS ParallelCluster version [e.g. 3.1.1]: 3.8.0
Full cluster configuration without any credentials or personal data.
- cluster configuration https://rpa.st/AAJQ
- whole procedure https://ciq.com/blog/how-to-use-aws-parallelcluster-3-8-0-with-rocky-linux-8/
Cluster name: rocky8-cluster
Output of pcluster describe-cluster command.

{
  "creationTime": "2023-12-20T16:19:01.897Z",
  "headNode": {
    "launchTime": "2023-12-20T16:23:42.000Z",
    "instanceId": "i-******",
    "publicIpAddress": "***.***.***.***",
    "instanceType": "t2.xlarge",
    "state": "running",
    "privateIpAddress": "10.0.0.230"
  },
  "version": "3.8.0",
  "clusterConfiguration": {
    "url": "******"
  },
  "tags": [
    {
      "value": "3.8.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "rocky8-cluster",
      "key": "parallelcluster:cluster-name"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "rocky8-cluster",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:******:stack/rocky8-cluster/******",
  "lastUpdatedTime": "2023-12-20T16:19:01.897Z",
  "region": "ap-northeast-1",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}

[Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:

Issue
- Intel MPI Benchmark IMB-MPI1 PingPong performance with EFA looks not as expected
  - It looks like EFA was used according to the log attached to this ticket
  - I would like to confirm if this is the expected range of performance.
- [result 1] with FI_PROVIDER=EFA FI_LOG_LEVEL=Debug srun --mpi=pmix ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong
- [result 2] with FI_PROVIDER=EFA FI_LOG_LEVEL=Debug mpirun --mca pml cm --mca mtl ofi ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong

#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
    268435456            1     99276.95      2703.91

The steps to reproduce the behavior.
- Build Intel MPI Benchmark
- ```
cd ~/
git clone https://github.com/intel/mpi-benchmarks.git
cd mpi-benchmarks/src_c
make all
```
- Submit job. Example job script job.sh https://rpa.st/WBZQ
- Output slurm-*.out https://rpa.st/22UA

Jacopo De Amicis · Answer 1 · Mon Jan 08 2024 23:06:34 GMT+0800 (China Standard Time)

Hi, I'm not able to see any of the attachments you provided to the issue.

In addition to this, have you tried running the benchmarks using the installation of OpenMPI provided on the AMI?

Yoshiaki Senda · Answer 2 · Mon Jan 08 2024 23:33:16 GMT+0800 (China Standard Time)

@jdeamicis
My apologies for the inconvenience. I uploaded agin. (It was misconfiguration of expiry date ...)
Yes, OpenMPI is /opt/amazon/openmpi.

Thank you

Jacopo De Amicis · Answer 3 · Tue Jan 09 2024 00:01:33 GMT+0800 (China Standard Time)

Apologies, I had misread the title of the ticket and automatically assumed you wanted to benchmark Intel MPI on a PC cluster :)

I can now see the attachments, thanks.

Jacopo De Amicis · Answer 4 · Tue Jan 09 2024 02:08:05 GMT+0800 (China Standard Time)

Could you please repeat the experiment increasing the number of iterations at large message sizes? You should be able to control it via the -time or the -iter_policy options of the IMB benchmarks.

Jacopo De Amicis · Answer 5 · Tue Jan 09 2024 02:14:48 GMT+0800 (China Standard Time)

Also, what happens if you use a parallel transfer benchmark such as the IMB1 Uniband or the OSU bw_mbw?

Yoshiaki Senda · Answer 6 · Tue Jan 09 2024 02:44:01 GMT+0800 (China Standard Time)

@jdeamicis
Thank you, this is PingPong result. Tried -iter 10 and -iter 30 and results are almost identical.

IMB-MPI1 -msglog 3:28 -iter_policy off -iter 10 PingPong
https://rpa.st/RZVA

Yoshiaki Senda · Answer 7 · Tue Jan 09 2024 03:02:27 GMT+0800 (China Standard Time)

I have issue with IMB-MPI1 Uniband, get back to here once it's solved..
https://rpa.st/UJZQ

Yoshiaki Senda · Answer 8 · Tue Jan 09 2024 03:31:49 GMT+0800 (China Standard Time)

If I increased --ntasks-per-node to more than 2, PingPong (single transfer benchmark) results gets better...

#!/bin/bash
#SBATCH --partition=c5n18
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

https://rpa.st/ZC2A

Jacopo De Amicis · Answer 9 · Tue Jan 09 2024 18:19:10 GMT+0800 (China Standard Time)

If I increased --ntasks-per-node to more than 2, PingPong (single transfer benchmark) results gets better...
#!/bin/bash
#SBATCH --partition=c5n18
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
https://rpa.st/ZC2A

Depending on the task distribution settings used in your job, you may be using 2 processes on the same node here, so this may be a shared memory transfer rather than over EFA. Only parallel transfer benchmarks like Uniband and the OSU bw_mbw can really exploit multiple pairs of ranks (or you could use some collectives), but please make sure you are communicating across nodes and not within nodes!

Jacopo De Amicis · Answer 10 · Tue Jan 09 2024 18:34:54 GMT+0800 (China Standard Time)

Another thing: could you please try using the OSU benchmarks to exclude any (unlikely) issue with the IMB benchmarks? Thanks!

hgreebe · Answer 11 · Wed Jan 10 2024 01:10:55 GMT+0800 (China Standard Time)

I ran the Intel MPI Benchmarks and got similar results:
https://rpa.st/AU5A

When I ran the osu_bw benchmark I got better performance:
https://rpa.st/W7LQ

So the difference in performance is either something about how the applications were compiled, or something within the actual applications.

Yoshiaki Senda · Answer 12 · Wed Jan 10 2024 04:42:00 GMT+0800 (China Standard Time)

@jdeamicis Thank you, good point. Will do OSU, my apologies for the delay.
@hgreebe Thank you. Could you please paste again with more longer life time (rpaste --life 1week) or use this https://rpa.st/ with Expiry (forever).

Yoshiaki Senda · Answer 13 · Wed Jan 10 2024 11:11:29 GMT+0800 (China Standard Time)

@jdeamicis OSU mbw_mr results srun --mpi=pmix ~/osu-micro-benchmarks-5.6.2/mpi/pt2pt/osu_mbw_mr
2 nodes 2 pair https://rpa.st/YLSA
2 nodes 8 pair https://rpa.st/W3TQ

hgreebe · Answer 14 · Wed Jan 10 2024 12:36:42 GMT+0800 (China Standard Time)

Intel MPI Benchmarks:
https://rpa.st/U5WQ

OSU Benchmarks:
https://rpa.st/X3ZQ

Jacopo De Amicis · Answer 15 · Wed Jan 10 2024 18:37:20 GMT+0800 (China Standard Time)

@panda1100 OK, it seems to me that the difference we are seeing is related to the type of MPI communication used in the two benchmarks: IMB PingPong (and osu_latency) use blocking communication, while osu_bw (and IMB Uniband) use non-blocking communication. Some difference is expected, but I personally wasn't expecting that much. This should be further investigated.

Yoshiaki Senda · Answer 16 · Thu Jan 11 2024 06:03:16 GMT+0800 (China Standard Time)

Thank you @jdeamicis . Please let me know if I can help on this.

This is OSU benchmarks osu_bw result. (previous one I used osu_mbw_mr

OSU Benchmarks (osu_bw)
https://rpa.st/H4NQ

Intel MPI Benchmarks (pingpong)
https://rpa.st/RZVA