GRTLCollaboration / GRTeclyn

Port of GRChombo to AMReX - under development!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

BinaryBH: Strong scaling test on CPUs

mirenradia opened this issue · comments

Strong scaling test

I have run a strong scaling test of the BinaryBH example on the CSD3 Ice Lakes using GRTeclyn and run an analogous test with GRChombo for comparison

Configuration details

Build options

I used the modules loaded by the rhel8/default-icl module. In particular, the relevant ones are

7) intel-oneapi-compilers/2022.1.0/gcc/b6zld2mz
8) intel-oneapi-mpi/2021.6.0/intel/guxuvcpm

GRTeclyn

  • AMReX at b752027c1
  • GRTeclyn at df07145
  • COMP = intel-llvm with CXXFLAGS += -ipo -xICELAKE-SERVER -fp-model=fast -cxx=icpx
  • USE_MPI = TRUE
  • USE_OMP = TRUE

GRChombo

  • Chombo at 38a95f8

  • GRChombo at b37dd59

  • Additional modules:

    12) intel-oneapi-mkl/2022.1.0/intel/mngj3ad6
    15) hdf5/1.10.8/intel/intel-oneapi-mpi/h75adcal
    
  • Show Make.defs.local
    DIM            ?= 3
    DEBUG          = FALSE
    OPT            = HIGH
    PRECISION      = DOUBLE
    PROFILE        = FALSE
    CXX            = icpx
    FC             = ifort
    MPI            = TRUE
    OPENMPCC       = TRUE
    MPICXX         = mpiicpc -cxx=icpx
    XTRACONFIG     = .ICELAKE.ICPX2022.1.0
    USE_64         = TRUE
    USE_HDF        = TRUE
    HDF5_PATH      = `pkg-config --variable=prefix hdf5`
    HDFINCFLAGS    = -I${HDF5_PATH}/include
    HDFLIBFLAGS    = -L${HDF5_PATH}/lib -lhdf5 -lz
    HDFMPIINCFLAGS = -I${HDF5_PATH}/include
    HDFMPILIBFLAGS = -L${HDF5_PATH}/lib -lhdf5 -lz
    NAMESPACE      = TRUE
    USE_MT         = FALSE
    cxxdbgflags    = -g
    cxxoptflags    = -g -O3 -ipo -fp-model=fast -xICELAKE-SERVER
    fdbgflags      = -g
    foptflags      = -g -O3 -fp-model fast=2 -xICELAKE-SERVER -qopt-zmm-usage=high
    CH_AR          = llvm-ar cr
    RANLIB         = llvm-ranlib
    XTRALDFLAGS    = -fuse-ld=lld
    syslibflags    = -qmkl=sequential #-lpetsc

Run time configuration

I ran on the CSD3 Ice Lakes with each node comprising a total of 76 cores (38 per socket). The different job size can be found in the table below

Number of nodes Total number of MPI tasks Number of OpenMP threads per task Total CPUs
1 19 4 76
2 38 4 152
4 76 4 304
8 152 4 608
16 304 4 1216
32 608 4 2032

I used SLURM job arrays to run each simulation 4 times and then took the mean of the results. A template jobscript can be found below

Show template jobscript
#!/bin/bash

#!#############################################################
#!#### Modify the options in this section as appropriate ######
#!#############################################################

#! sbatch directives begin here ###############################
#! Name of the job:
#SBATCH -J GRC-scaling-test
#! Which project should be charged:
#SBATCH -A DIRAC-DP002-CPU
#SBATCH -p icelake
#! How many whole nodes should be allocated?
#SBATCH --nodes=<number of nodes>
#! How many tasks per node
#SBATCH --ntasks-per-node=19
#! How many cores per task
#SBATCH -c 4
#! How much wallclock time will be required?
#SBATCH --time=0:30:00
#! What types of email messages do you wish to receive?
#SBATCH --mail-type=all
#SBATCH --array=1-4

#! sbatch directives end here (put any additional directives above this line)

#! Notes:
#! Charging is determined by cpu number*walltime.

#! Number of nodes and tasks per node allocated by SLURM (do not change):
numnodes=$SLURM_JOB_NUM_NODES
numtasks=$SLURM_NTASKS
mpi_tasks_per_node=$SLURM_NTASKS_PER_NODE
#! ############################################################
#! Modify the settings below to specify the application's environment, location
#! and launch method:

#! Optionally modify the environment seen by the application
#! (note that SLURM reproduces the environment at submission irrespective of ~/.bashrc):
. /etc/profile.d/modules.sh                # Leave this line (enables the module command)
module purge                               # Removes all modules still loaded
module restore grchombo-intel-2021.6-icl   # module collection with modules above

#! Full path to application executable:
application="/path/to/executable"

#! Run options for the application:
options="$SLURM_SUBMIT_DIR/params.txt"

#! Work directory (i.e. where the job will run):
workdir="$SLURM_SUBMIT_DIR/run${SLURM_ARRAY_TASK_ID}"

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export OMP_PROC_BIND=true

CMD="srun -K -c $SLURM_CPUS_PER_TASK --distribution=block:block $application $options"

###############################################################
### You should not have to change anything below this line ####
###############################################################

mkdir -p $workdir
cd $workdir
echo -e "Changed directory to `pwd`.\n"

JOBID=$SLURM_JOB_ID

echo -e "JobID: $JOBID\n======"
echo "Time: `date`"
echo "Running on master node: `hostname`"
echo "Current directory: `pwd`"

if [ "$SLURM_JOB_NODELIST" ]; then
        #! Create a machine file:
        export NODEFILE=`generate_pbs_nodefile`
        cat $NODEFILE | uniq > machine.file.$JOBID
        echo -e "\nNodes allocated:\n================"
        echo `cat machine.file.$JOBID | sed -e 's/\..*$//g'`
fi

module list

echo -e "\nnumtasks=$numtasks, numnodes=$numnodes, mpi_tasks_per_node=$mpi_tasks_per_node (OMP_NUM_THREADS=$OMP_NUM_THREADS)"

echo -e "\nExecuting command:\n==================\n$CMD\n"

eval $CMD 

Parameters

The parameter files for each code can be found by following the links below

Note that I fixed the box size to $16^3$ in order to maximize scaling efficiency.

Results

The average walltime to complete 2 timesteps on the coarsest timesteps is shown in the plot below. Perfect scaling is shown with the dashed line.

strong-scaling-walltime

The strong scaling efficiency (i.e. a value of $1.0$ means that increasing the number of nodes/CPUs by a factor of $F$ would mean that the simulation walltime would decrease by a factor of $F$) is shown in the plot below.

strong-scaling-efficiency

Discussion

Although the tagging parameters have been chosen the same for both GRChombo and GRTeclyn, inspecting the output shows that GRChombo consistently refines a larger region of the grid. This explains at least some of the difference between the significantly shorter walltimes for GRTeclyn compared to GRChombo (note that the log scale on the plot diminishes the apparent difference). Because the GRChombo simulation configuration has larger refined regions and thus more cells to evolve, it's not surprising that better strong scaling is then achieved for the larger number of nodes as GRTeclyn reaches its bottlenecks faster.

The only slightly surprising result is that the GRChombo simulation with the largest job size (32 nodes) completes a little faster than the GRTeclyn simulation of the same job size.

It might be worth re-running this scaling analysis after determining the largest possible simulation that can fit on a single Ice Lake node.

For completeness, attached is a zip file containing the Jupyter notebook I used to generate the plots:
20240426_plot_strong_scaling.zip