BinaryBH: Strong scaling test on CPUs
mirenradia opened this issue · comments
Strong scaling test
I have run a strong scaling test of the BinaryBH example on the CSD3 Ice Lakes using GRTeclyn and run an analogous test with GRChombo for comparison
Configuration details
Build options
I used the modules loaded by the rhel8/default-icl
module. In particular, the relevant ones are
7) intel-oneapi-compilers/2022.1.0/gcc/b6zld2mz
8) intel-oneapi-mpi/2021.6.0/intel/guxuvcpm
GRTeclyn
- AMReX at
b752027c1
- GRTeclyn at df07145
COMP = intel-llvm
withCXXFLAGS += -ipo -xICELAKE-SERVER -fp-model=fast -cxx=icpx
USE_MPI = TRUE
USE_OMP = TRUE
GRChombo
-
Chombo at
38a95f8
-
GRChombo at
b37dd59
-
Additional modules:
12) intel-oneapi-mkl/2022.1.0/intel/mngj3ad6 15) hdf5/1.10.8/intel/intel-oneapi-mpi/h75adcal
-
Show Make.defs.local
DIM ?= 3 DEBUG = FALSE OPT = HIGH PRECISION = DOUBLE PROFILE = FALSE CXX = icpx FC = ifort MPI = TRUE OPENMPCC = TRUE MPICXX = mpiicpc -cxx=icpx XTRACONFIG = .ICELAKE.ICPX2022.1.0 USE_64 = TRUE USE_HDF = TRUE HDF5_PATH = `pkg-config --variable=prefix hdf5` HDFINCFLAGS = -I${HDF5_PATH}/include HDFLIBFLAGS = -L${HDF5_PATH}/lib -lhdf5 -lz HDFMPIINCFLAGS = -I${HDF5_PATH}/include HDFMPILIBFLAGS = -L${HDF5_PATH}/lib -lhdf5 -lz NAMESPACE = TRUE USE_MT = FALSE cxxdbgflags = -g cxxoptflags = -g -O3 -ipo -fp-model=fast -xICELAKE-SERVER fdbgflags = -g foptflags = -g -O3 -fp-model fast=2 -xICELAKE-SERVER -qopt-zmm-usage=high CH_AR = llvm-ar cr RANLIB = llvm-ranlib XTRALDFLAGS = -fuse-ld=lld syslibflags = -qmkl=sequential #-lpetsc
Run time configuration
I ran on the CSD3 Ice Lakes with each node comprising a total of 76 cores (38 per socket). The different job size can be found in the table below
Number of nodes | Total number of MPI tasks | Number of OpenMP threads per task | Total CPUs |
---|---|---|---|
1 | 19 | 4 | 76 |
2 | 38 | 4 | 152 |
4 | 76 | 4 | 304 |
8 | 152 | 4 | 608 |
16 | 304 | 4 | 1216 |
32 | 608 | 4 | 2032 |
I used SLURM job arrays to run each simulation 4 times and then took the mean of the results. A template jobscript can be found below
Show template jobscript
#!/bin/bash
#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################
#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J GRC-scaling-test
#! Which project should be charged:
#SBATCH -A DIRAC-DP002-CPU
#SBATCH -p icelake
#! How many whole nodes should be allocated?
#SBATCH --nodes=<number of nodes>
#! How many tasks per node
#SBATCH --ntasks-per-node=19
#! How many cores per task
#SBATCH -c 4
#! How much wallclock time will be required?
#SBATCH --time=0:30:00
#! What types of email messages do you wish to receive?
#SBATCH --mail-type=all
#SBATCH --array=1-4
#! sbatch directives end here (put any additional directives above this line)
#! Notes:
#! Charging is determined by cpu number*walltime.
#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$SLURM_NTASKS_PER_NODE
#! ############################################################
#! Modify the settings below to specify the application's environment, location
#! and launch method:
#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh # Leave this line (enables the module command)
module purge # Removes all modules still loaded
module restore grchombo-intel-2021.6-icl # module collection with modules above
#! Full path to application executable:
application="/path/to/executable"
#! Run options for the application:
options="$SLURM_SUBMIT_DIR/params.txt"
#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR/run${SLURM_ARRAY_TASK_ID}"
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=true
CMD="srun -K -c $SLURM_CPUS_PER_TASK --distribution=block:block $application $options"
###############################################################
### You should not have to change anything below this line ####
###############################################################
mkdir -p $workdir
cd $workdir
echo -e "Changed directory to `pwd`.\n"
JOBID=$SLURM_JOB_ID
echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"
if [ "$SLURM_JOB_NODELIST" ]; then
#! Create a machine file:
export NODEFILE=`generate_pbs_nodefile`
cat $NODEFILE | uniq > machine.file.$JOBID
echo -e "\nNodes allocated:\n================"
echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi
module list
echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"
echo -e "\nExecuting command:\n==================\n$CMD\n"
eval $CMD
Parameters
The parameter files for each code can be found by following the links below
Note that I fixed the box size to
Results
The average walltime to complete 2 timesteps on the coarsest timesteps is shown in the plot below. Perfect scaling is shown with the dashed line.
The strong scaling efficiency (i.e. a value of
Discussion
Although the tagging parameters have been chosen the same for both GRChombo and GRTeclyn, inspecting the output shows that GRChombo consistently refines a larger region of the grid. This explains at least some of the difference between the significantly shorter walltimes for GRTeclyn compared to GRChombo (note that the log scale on the plot diminishes the apparent difference). Because the GRChombo simulation configuration has larger refined regions and thus more cells to evolve, it's not surprising that better strong scaling is then achieved for the larger number of nodes as GRTeclyn reaches its bottlenecks faster.
The only slightly surprising result is that the GRChombo simulation with the largest job size (32 nodes) completes a little faster than the GRTeclyn simulation of the same job size.
It might be worth re-running this scaling analysis after determining the largest possible simulation that can fit on a single Ice Lake node.
For completeness, attached is a zip file containing the Jupyter notebook I used to generate the plots:
20240426_plot_strong_scaling.zip