aws / aws-parallelcluster

AWS ParallelCluster is an AWS supported Open Source cluster management tool to deploy and manage HPC clusters in the AWS cloud.

Home Page:https://github.com/aws/aws-parallelcluster

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Intel MPI Benchmarks (IMB-MPI1) performance issue? with EFA and Rocky Linux 8 custom image

panda1100 opened this issue · comments

Required Info:

{
  "creationTime": "2023-12-20T16:19:01.897Z",
  "headNode": {
    "launchTime": "2023-12-20T16:23:42.000Z",
    "instanceId": "i-******",
    "publicIpAddress": "***.***.***.***",
    "instanceType": "t2.xlarge",
    "state": "running",
    "privateIpAddress": "10.0.0.230"
  },
  "version": "3.8.0",
  "clusterConfiguration": {
    "url": "******"
  },
  "tags": [
    {
      "value": "3.8.0",
      "key": "parallelcluster:version"
    },
    {
      "value": "rocky8-cluster",
      "key": "parallelcluster:cluster-name"
    }
  ],
  "cloudFormationStackStatus": "CREATE_COMPLETE",
  "clusterName": "rocky8-cluster",
  "computeFleetStatus": "RUNNING",
  "cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:******:stack/rocky8-cluster/******",
  "lastUpdatedTime": "2023-12-20T16:19:01.897Z",
  "region": "ap-northeast-1",
  "clusterStatus": "CREATE_COMPLETE",
  "scheduler": {
    "type": "slurm"
  }
}
  • [Optional] Arn of the cluster CloudFormation main stack:

Bug description and how to reproduce:

  • Issue
    • Intel MPI Benchmark IMB-MPI1 PingPong performance with EFA looks not as expected
      • It looks like EFA was used according to the log attached to this ticket
      • I would like to confirm if this is the expected range of performance.
    • [result 1] with FI_PROVIDER=EFA FI_LOG_LEVEL=Debug srun --mpi=pmix ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong
    • [result 2] with FI_PROVIDER=EFA FI_LOG_LEVEL=Debug mpirun --mca pml cm --mca mtl ofi ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong
#---------------------------------------------------
# Benchmarking PingPong 
# #processes = 2 
#---------------------------------------------------
       #bytes #repetitions      t[usec]   Mbytes/sec
    268435456            1     99276.95      2703.91
  • The steps to reproduce the behavior.
    • Build Intel MPI Benchmark
    • cd ~/
      git clone https://github.com/intel/mpi-benchmarks.git
      cd mpi-benchmarks/src_c
      make all
      
    • Submit job. Example job script job.sh https://rpa.st/WBZQ
    • Output slurm-*.out https://rpa.st/22UA

Hi, I'm not able to see any of the attachments you provided to the issue.

In addition to this, have you tried running the benchmarks using the installation of OpenMPI provided on the AMI?

@jdeamicis
My apologies for the inconvenience. I uploaded agin. (It was misconfiguration of expiry date ...)
Yes, OpenMPI is /opt/amazon/openmpi.

Thank you

Apologies, I had misread the title of the ticket and automatically assumed you wanted to benchmark Intel MPI on a PC cluster :)

I can now see the attachments, thanks.

Could you please repeat the experiment increasing the number of iterations at large message sizes? You should be able to control it via the -time or the -iter_policy options of the IMB benchmarks.

Also, what happens if you use a parallel transfer benchmark such as the IMB1 Uniband or the OSU bw_mbw?

@jdeamicis
Thank you, this is PingPong result. Tried -iter 10 and -iter 30 and results are almost identical.

IMB-MPI1 -msglog 3:28 -iter_policy off -iter 10 PingPong
https://rpa.st/RZVA

I have issue with IMB-MPI1 Uniband, get back to here once it's solved..
https://rpa.st/UJZQ

If I increased --ntasks-per-node to more than 2, PingPong (single transfer benchmark) results gets better...

#!/bin/bash
#SBATCH --partition=c5n18
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

https://rpa.st/ZC2A

If I increased --ntasks-per-node to more than 2, PingPong (single transfer benchmark) results gets better...

#!/bin/bash
#SBATCH --partition=c5n18
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8

https://rpa.st/ZC2A

Depending on the task distribution settings used in your job, you may be using 2 processes on the same node here, so this may be a shared memory transfer rather than over EFA. Only parallel transfer benchmarks like Uniband and the OSU bw_mbw can really exploit multiple pairs of ranks (or you could use some collectives), but please make sure you are communicating across nodes and not within nodes!

Another thing: could you please try using the OSU benchmarks to exclude any (unlikely) issue with the IMB benchmarks? Thanks!

I ran the Intel MPI Benchmarks and got similar results:
https://rpa.st/AU5A

When I ran the osu_bw benchmark I got better performance:
https://rpa.st/W7LQ

So the difference in performance is either something about how the applications were compiled, or something within the actual applications.

@jdeamicis Thank you, good point. Will do OSU, my apologies for the delay.
@hgreebe Thank you. Could you please paste again with more longer life time (rpaste --life 1week) or use this https://rpa.st/ with Expiry (forever).

@jdeamicis OSU mbw_mr results srun --mpi=pmix ~/osu-micro-benchmarks-5.6.2/mpi/pt2pt/osu_mbw_mr
2 nodes 2 pair https://rpa.st/YLSA
2 nodes 8 pair https://rpa.st/W3TQ

Intel MPI Benchmarks:
https://rpa.st/U5WQ

OSU Benchmarks:
https://rpa.st/X3ZQ

@panda1100 OK, it seems to me that the difference we are seeing is related to the type of MPI communication used in the two benchmarks: IMB PingPong (and osu_latency) use blocking communication, while osu_bw (and IMB Uniband) use non-blocking communication. Some difference is expected, but I personally wasn't expecting that much. This should be further investigated.

Thank you @jdeamicis . Please let me know if I can help on this.

This is OSU benchmarks osu_bw result. (previous one I used osu_mbw_mr

OSU Benchmarks (osu_bw)
https://rpa.st/H4NQ

Intel MPI Benchmarks (pingpong)
https://rpa.st/RZVA