Intel MPI Benchmarks (IMB-MPI1) performance issue? with EFA and Rocky Linux 8 custom image
panda1100 opened this issue · comments
Required Info:
- AWS ParallelCluster version [e.g. 3.1.1]: 3.8.0
- Full cluster configuration without any credentials or personal data.
- cluster configuration https://rpa.st/AAJQ
- whole procedure https://ciq.com/blog/how-to-use-aws-parallelcluster-3-8-0-with-rocky-linux-8/
- Cluster name: rocky8-cluster
- Output of
pcluster describe-cluster
command.
{
"creationTime": "2023-12-20T16:19:01.897Z",
"headNode": {
"launchTime": "2023-12-20T16:23:42.000Z",
"instanceId": "i-******",
"publicIpAddress": "***.***.***.***",
"instanceType": "t2.xlarge",
"state": "running",
"privateIpAddress": "10.0.0.230"
},
"version": "3.8.0",
"clusterConfiguration": {
"url": "******"
},
"tags": [
{
"value": "3.8.0",
"key": "parallelcluster:version"
},
{
"value": "rocky8-cluster",
"key": "parallelcluster:cluster-name"
}
],
"cloudFormationStackStatus": "CREATE_COMPLETE",
"clusterName": "rocky8-cluster",
"computeFleetStatus": "RUNNING",
"cloudformationStackArn": "arn:aws:cloudformation:ap-northeast-1:******:stack/rocky8-cluster/******",
"lastUpdatedTime": "2023-12-20T16:19:01.897Z",
"region": "ap-northeast-1",
"clusterStatus": "CREATE_COMPLETE",
"scheduler": {
"type": "slurm"
}
}
- [Optional] Arn of the cluster CloudFormation main stack:
Bug description and how to reproduce:
- Issue
- Intel MPI Benchmark
IMB-MPI1 PingPong
performance with EFA looks not as expected- It looks like EFA was used according to the log attached to this ticket
- I would like to confirm if this is the expected range of performance.
- [result 1] with
FI_PROVIDER=EFA FI_LOG_LEVEL=Debug srun --mpi=pmix ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong
- [result 2] with
FI_PROVIDER=EFA FI_LOG_LEVEL=Debug mpirun --mca pml cm --mca mtl ofi ~/mpi-benchmarks/src_c/IMB-MPI1 -msglog 3:28 PingPong
- Intel MPI Benchmark
#---------------------------------------------------
# Benchmarking PingPong
# #processes = 2
#---------------------------------------------------
#bytes #repetitions t[usec] Mbytes/sec
268435456 1 99276.95 2703.91
- The steps to reproduce the behavior.
- Build Intel MPI Benchmark
-
cd ~/ git clone https://github.com/intel/mpi-benchmarks.git cd mpi-benchmarks/src_c make all
- Submit job. Example job script
job.sh
https://rpa.st/WBZQ - Output
slurm-*.out
https://rpa.st/22UA
Hi, I'm not able to see any of the attachments you provided to the issue.
In addition to this, have you tried running the benchmarks using the installation of OpenMPI provided on the AMI?
@jdeamicis
My apologies for the inconvenience. I uploaded agin. (It was misconfiguration of expiry date ...)
Yes, OpenMPI is /opt/amazon/openmpi
.
Thank you
Apologies, I had misread the title of the ticket and automatically assumed you wanted to benchmark Intel MPI on a PC cluster :)
I can now see the attachments, thanks.
Could you please repeat the experiment increasing the number of iterations at large message sizes? You should be able to control it via the -time
or the -iter_policy
options of the IMB benchmarks.
Also, what happens if you use a parallel transfer benchmark such as the IMB1 Uniband or the OSU bw_mbw?
@jdeamicis
Thank you, this is PingPong result. Tried -iter 10
and -iter 30
and results are almost identical.
IMB-MPI1 -msglog 3:28 -iter_policy off -iter 10 PingPong
https://rpa.st/RZVA
I have issue with IMB-MPI1 Uniband, get back to here once it's solved..
https://rpa.st/UJZQ
If I increased --ntasks-per-node
to more than 2, PingPong (single transfer benchmark) results gets better...
#!/bin/bash
#SBATCH --partition=c5n18
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=8
If I increased
--ntasks-per-node
to more than 2, PingPong (single transfer benchmark) results gets better...#!/bin/bash #SBATCH --partition=c5n18 #SBATCH --nodes=2 #SBATCH --ntasks-per-node=8
Depending on the task distribution settings used in your job, you may be using 2 processes on the same node here, so this may be a shared memory transfer rather than over EFA. Only parallel transfer benchmarks like Uniband and the OSU bw_mbw can really exploit multiple pairs of ranks (or you could use some collectives), but please make sure you are communicating across nodes and not within nodes!
Another thing: could you please try using the OSU benchmarks to exclude any (unlikely) issue with the IMB benchmarks? Thanks!
I ran the Intel MPI Benchmarks and got similar results:
https://rpa.st/AU5A
When I ran the osu_bw benchmark I got better performance:
https://rpa.st/W7LQ
So the difference in performance is either something about how the applications were compiled, or something within the actual applications.
@jdeamicis Thank you, good point. Will do OSU, my apologies for the delay.
@hgreebe Thank you. Could you please paste again with more longer life time (rpaste --life 1week
) or use this https://rpa.st/ with Expiry (forever)
.
@jdeamicis OSU mbw_mr results srun --mpi=pmix ~/osu-micro-benchmarks-5.6.2/mpi/pt2pt/osu_mbw_mr
2 nodes 2 pair https://rpa.st/YLSA
2 nodes 8 pair https://rpa.st/W3TQ
Intel MPI Benchmarks:
https://rpa.st/U5WQ
OSU Benchmarks:
https://rpa.st/X3ZQ
@panda1100 OK, it seems to me that the difference we are seeing is related to the type of MPI communication used in the two benchmarks: IMB PingPong (and osu_latency
) use blocking communication, while osu_bw
(and IMB Uniband) use non-blocking communication. Some difference is expected, but I personally wasn't expecting that much. This should be further investigated.
Thank you @jdeamicis . Please let me know if I can help on this.
This is OSU benchmarks osu_bw
result. (previous one I used osu_mbw_mr
OSU Benchmarks (osu_bw)
https://rpa.st/H4NQ
Intel MPI Benchmarks (pingpong)
https://rpa.st/RZVA