BinaryBH: Strong scaling test on CPUs

Question

BinaryBH: Strong scaling test on CPUs

mirenradia opened this issue 6 months ago · comments

Strong scaling test

I have run a strong scaling test of the BinaryBH example on the CSD3 Ice Lakes using GRTeclyn and run an analogous test with GRChombo for comparison

Configuration details

Build options

I used the modules loaded by the rhel8/default-icl module. In particular, the relevant ones are

7) intel-oneapi-compilers/2022.1.0/gcc/b6zld2mz
8) intel-oneapi-mpi/2021.6.0/intel/guxuvcpm

GRTeclyn

AMReX at b752027c1
GRTeclyn at df07145
COMP = intel-llvm with CXXFLAGS += -ipo -xICELAKE-SERVER -fp-model=fast -cxx=icpx
USE_MPI = TRUE
USE_OMP = TRUE

GRChombo

Chombo at 38a95f8
GRChombo at b37dd59

Additional modules:

12) intel-oneapi-mkl/2022.1.0/intel/mngj3ad6
15) hdf5/1.10.8/intel/intel-oneapi-mpi/h75adcal

Show Make.defs.local

DIM            ?= 3
DEBUG          = FALSE
OPT            = HIGH
PRECISION      = DOUBLE
PROFILE        = FALSE
CXX            = icpx
FC             = ifort
MPI            = TRUE
OPENMPCC       = TRUE
MPICXX         = mpiicpc -cxx=icpx
XTRACONFIG     = .ICELAKE.ICPX2022.1.0
USE_64         = TRUE
USE_HDF        = TRUE
HDF5_PATH      = `pkg-config --variable=prefix hdf5`
HDFINCFLAGS    = -I${HDF5_PATH}/include
HDFLIBFLAGS    = -L${HDF5_PATH}/lib -lhdf5 -lz
HDFMPIINCFLAGS = -I${HDF5_PATH}/include
HDFMPILIBFLAGS = -L${HDF5_PATH}/lib -lhdf5 -lz
NAMESPACE      = TRUE
USE_MT         = FALSE
cxxdbgflags    = -g
cxxoptflags    = -g -O3 -ipo -fp-model=fast -xICELAKE-SERVER
fdbgflags      = -g
foptflags      = -g -O3 -fp-model fast=2 -xICELAKE-SERVER -qopt-zmm-usage=high
CH_AR          = llvm-ar cr
RANLIB         = llvm-ranlib
XTRALDFLAGS    = -fuse-ld=lld
syslibflags    = -qmkl=sequential #-lpetsc

Run time configuration

I ran on the CSD3 Ice Lakes with each node comprising a total of 76 cores (38 per socket). The different job size can be found in the table below

Number of nodes	Total number of MPI tasks	Number of OpenMP threads per task	Total CPUs
1	19	4	76
2	38	4	152
4	76	4	304
8	152	4	608
16	304	4	1216
32	608	4	2032

I used SLURM job arrays to run each simulation 4 times and then took the mean of the results. A template jobscript can be found below

Show template jobscript

#!/bin/bash

#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################

#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J GRC-scaling-test
#! Which project should be charged:
#SBATCH -A DIRAC-DP002-CPU
#SBATCH -p icelake
#! How many whole nodes should be allocated?
#SBATCH --nodes=<number of nodes>
#! How many tasks per node
#SBATCH --ntasks-per-node=19
#! How many cores per task
#SBATCH -c 4
#! How much wallclock time will be required?
#SBATCH --time=0:30:00
#! What types of email messages do you wish to receive?
#SBATCH --mail-type=all
#SBATCH --array=1-4

#! sbatch directives end here (put any additional directives above this line)

#! Notes:
#! Charging is determined by cpu number*walltime.

#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$SLURM_NTASKS_PER_NODE
#! ############################################################
#! Modify the settings below to specify the application's environment, location
#! and launch method:

#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh                # Leave this line (enables the module command)
module purge                               # Removes all modules still loaded
module restore grchombo-intel-2021.6-icl   # module collection with modules above

#! Full path to application executable:
application="/path/to/executable"

#! Run options for the application:
options="$SLURM_SUBMIT_DIR/params.txt"

#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR/run${SLURM_ARRAY_TASK_ID}"

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=true

CMD="srun -K -c $SLURM_CPUS_PER_TASK --distribution=block:block $application $options"

###############################################################
### You should not have to change anything below this line ####
###############################################################

mkdir -p $workdir
cd $workdir
echo -e "Changed directory to `pwd`.\n"

JOBID=$SLURM_JOB_ID

echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

if [ "$SLURM_JOB_NODELIST" ]; then
        #! Create a machine file:
        export NODEFILE=`generate_pbs_nodefile`
        cat $NODEFILE | uniq > machine.file.$JOBID
        echo -e "\nNodes allocated:\n================"
        echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi

module list

echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"

echo -e "\nExecuting command:\n==================\n$CMD\n"

eval $CMD

Parameters

The parameter files for each code can be found by following the links below

Note that I fixed the box size to $16^3$ in order to maximize scaling efficiency.

Results

The average walltime to complete 2 timesteps on the coarsest timesteps is shown in the plot below. Perfect scaling is shown with the dashed line.

The strong scaling efficiency (i.e. a value of $1.0$ means that increasing the number of nodes/CPUs by a factor of $F$ would mean that the simulation walltime would decrease by a factor of $F$) is shown in the plot below.

Discussion

Although the tagging parameters have been chosen the same for both GRChombo and GRTeclyn, inspecting the output shows that GRChombo consistently refines a larger region of the grid. This explains at least some of the difference between the significantly shorter walltimes for GRTeclyn compared to GRChombo (note that the log scale on the plot diminishes the apparent difference). Because the GRChombo simulation configuration has larger refined regions and thus more cells to evolve, it's not surprising that better strong scaling is then achieved for the larger number of nodes as GRTeclyn reaches its bottlenecks faster.

The only slightly surprising result is that the GRChombo simulation with the largest job size (32 nodes) completes a little faster than the GRTeclyn simulation of the same job size.

It might be worth re-running this scaling analysis after determining the largest possible simulation that can fit on a single Ice Lake node.

Miren Radia · Answer 1 · Thu May 23 2024 22:55:20 GMT+0800 (China Standard Time)

For completeness, attached is a zip file containing the Jupyter notebook I used to generate the plots:
20240426_plot_strong_scaling.zip