Artifact of "Distributed Continuation Stealing is More Scalable than You Might Think" in Cluster '22
This is the artifact of the following paper:
Shumpei Shiina, Kenjiro Taura. Distributed Continuation Stealing is More Scalable than You Might Think. in the 24th IEEE International Conference on Cluster Computing (Cluster ‘22). 2022.
massivethreads-dm
: MassiveThreads/DM- Original repository: https://github.com/shigeki-akiyama/massivethreads-dm
dnpbenchs
: Benchmarks for MassiveThreads/DMisola
: job management tool- Original repository: https://github.com/s417-lama/isola
raw_results
: raw experimental data obtained in our environments
Competitors of UTS:
saws
: SAWS- Original repository: https://github.com/brianlarkins/saws
charm
: Charm++- Original repository: https://github.com/UIUC-PPL/charm
x10
: X10- Original repository: https://github.com/x10-lang/x10
- The UTS benchmark was taken from: https://github.com/x10-lang/x10-benchmarks
Several versions of MassiveThreads/DM are separately managed in subdirs.
Continuation stealing with different join implementations (Section III):
waitq
: stalling join (baseline)waitq_gc
: stalling join + local collectiongreedy
: greedy joingreedy_gc
: greedy join + local collection
Child stealing (Section IV.B):
child_stealing
: RtC threadschild_stealing_ult
: Full threads
Extended future implementations for evaluation of LCS (Section V.D):
future_multi
: Cont. Steal (greedy) in TABLE III (based ongreedy_gc
)waitq_gc_multi
: Cont. Steal (stalling) in TABLE III (based onwaitq_gc
)child_stealing_ult_multi
: Child Steal (Full) in TABLE III (based onchild_stealing_ult
)
In this artifact, we use a tool called isola
to manage experimental environments for MassiveThreads/DM.
To install isola, run:
cd isola/
./install.sh # or `ISOLA_ROOT=/path/to/install ./install.bash` to explicitly set the installation path
and set the PATH env val (e.g., in .bashrc
):
export PATH=~/.isola/bin:$PATH
To check if isola is correctly installed:
$ isola -h
Usage: isola [OPTIONS] COMMAND
COMMANDs:
create create an isola
run run commands in a temporal isola
ls list isolas
rm remove isolas
where show path to an isola dir
cleanup try to fix inconsistent state
OPTIONS:
-h show help
Please see isola/README.md for more information.
As MassiveThreads/DM uses MPI-3 RMA as a backend communication library, MPI needs to be installed.
Currently, we only tested the following MPI implementations and platforms:
- Open MPI v5.0.x for systems with Intel x86_64 CPUs and InfiniBand
- Fujitsu MPI for Fugaku-like systems (A64FX + Tofu Interconnect-D)
Important notes on the use of MPI-3 RMA:
- Open MPI v4 (or earlier) does not support MCA parameter
osc_ucx_acc_single_intrinsic
, which is crucial for work-stealing performance in MassiveThreads/DM. Thus, unless Open MPI v5.0.x (or newer) is used, the performance will be poor. - Although many MPI implementations support one-sided communication (MPI-3 RMA), but some of them are not truly "one-sided"; progress of RMA operations depends on the remote side (i.e., unless the remote process calls some MPI functions, no progress is made for RMA operations).
- See also the work by Schuchart et al. at ExaMPI '19: Using MPI-3 RMA for Active Messages
- Open MPI and Fujitsu MPI (which is also based on Open MPI) are truly one-sided, so we used them for evaluation
In the following, we show how to install Open MPI v5.0.0rc7 with UCX v1.11.0, but the way of installation is not limited to this example.
Example script for UCX v1.11.0 installation:
UCX_VERSION=1.11.0
wget https://github.com/openucx/ucx/releases/download/v${UCX_VERSION}/ucx-${UCX_VERSION}.tar.gz
tar xvf ucx-${UCX_VERSION}.tar.gz
cd ucx-${UCX_VERSION}/
./contrib/configure-release --prefix=$UCX_INSTALL_PREFIX
make -j
make install
Example script for Open MPI v5.0.0rc7 installation:
wget https://download.open-mpi.org/release/open-mpi/v5.0/openmpi-5.0.0rc7.tar.gz
tar xvf openmpi-5.0.0rc7.tar.gz
cd openmpi-5.0.0rc7/
../configure --prefix=$OMPI_INSTALL_PREFIX --with-ucx=$UCX_INSTALL_PREFIX --disable-man-pages --with-pmix=internal --with-hwloc=internal --with-libevent=internal
make -j
make install
envs.sh
is loaded internally in many script files in this artifact; you should correctly configure this script for your system.
To configure envs.sh
, add some branches to the if-else statements so that $(hostname)
correctly matches your machine's hostname.
Variables to be set in envs.sh
:
MACHINE_NAME
: Unique name for your machine- Used in measurement scripts in
dnpbenchs
- Used in measurement scripts in
CPU_SIMD_FREQ
: CPU frequency while SIMD instructions are executed on all the cores (in GHz)- Used in synthetic benchmarks to set expected execution times for leaf tasks
- SIMD instructions: AVX2 instructions for x86_64 and SVE instructions for A64FX
- Note: the AVX frequency for Intel CPUs is often different from the base CPU frequency
Additionally, you can load some PATH variables (e.g., paths to compilers, MPI) in envs.sh
(not mandatory).
massivethreads-dm/make_common.bash
is a common script to build each version of MassiveThreads/DM.
You can modify this script to configure MassiveThreads/DM for each machine.
The following configure command will work for most cases:
CCFLAGS=$CFLAGS CXXFLAGS=$CFLAGS ./configure --with-comm-layer=mpi3 ${logger_options[@]+"${logger_options[@]}"} --prefix=${INSTALL_DIR}
--with-comm-layer=mpi3
is mandatory; other communication layers do not work.--enable-polling
option is for MPI-3 RMA implementations that are not truly one-sided; it inserts polling MPI calls (MPI_Iprobe()
) to the work-stealing scheduler's code, which might add significant overheads to fine-grained tasks- If the execution hangs, try this option
To build all versions of MassiveThreads/DM, run:
cd massivethreads-dm/make_common.bash
./make_all.bash
All versions of MassiveThreads/DM are installed by Isola. You can check your installations:
$ isola ls massivethreads-dm
massivethreads-dm:child_stealing
massivethreads-dm:child_stealing_prof
massivethreads-dm:child_stealing_ult
massivethreads-dm:child_stealing_ult_multi
massivethreads-dm:child_stealing_ult_prof
massivethreads-dm:child_stealing_ult_trace
massivethreads-dm:future_multi
massivethreads-dm:greedy
massivethreads-dm:greedy_gc
massivethreads-dm:greedy_gc_prof
massivethreads-dm:greedy_gc_trace
massivethreads-dm:waitq
massivethreads-dm:waitq_gc
massivethreads-dm:waitq_gc_multi
massivethreads-dm:waitq_gc_prof
massivethreads-dm:waitq_gc_trace
massivethreads-dm:waitq_prof
Isola names *_prof
are profiling-enabled versions, and *_trace
are tracing-enabled versions for timeline visualization.
Before running experiments, you need to configure the dnpbenchs/measure_jobs/common.bash
file.
In the dnp_run_common()
function in that file, you need to set the following three parameters:
DNP_CORES
variable: the maximum number of cores within a node in your system.DNP_NODES
variable: the number of nodes allocated by the job manager (e.g., slurm)- e.g.,,
SLURM_JOB_NODES
in slurm - If you do not use any job manager, manually set this variable for each measurement
- e.g.,,
dnp_mpirun()
function: specify how to invoke an MPI program (see below)
An example of dnp_mpirun()
in Open MPI:
dnp_mpirun() {
local n_processes=$1
local n_processes_per_node=$2
local out_file=$3
mpirun -n $n_processes -N $n_processes_per_node \
--mca osc_ucx_acc_single_intrinsic true \
../opt/massivethreads-dm/bin/madm_disable_aslr "${@:4}" | tee $out_file
}
In addition, you may want to specify where the hostfile exists and some other options.
Note: The --mca osc_ucx_acc_single_intrinsic true
option is critical to the performance as mentioned eariler.
Files:
dnpbenchs/sched_bench.{cpp,hpp}
: C++ code for PFor benchmark and RecPFor benchmark (Section IV.C)dnpbenchs/measure_jobs/sched_pfor.bash
: measurement config for PFordnpbenchs/measure_jobs/sched_rrm.bash
: measurement config for RecPFordnpbenchs/scripts/batch_sched.bash
: batch script to run measurement jobs sequentially- (optional)
dnpbenchs/scripts/batch_sched_submit.bash
: batch script to run measurement jobs in parallel by using a job management system- You need to configure
dnpbenchs/scripts/submit.bash
by specifying how to submit a job to the job management system in your environment
- You need to configure
To run:
cd dnpbenchs
./scripts/batch_sched.bash
After running experiments, the results will be saved in each Isola dir.
List of Isola dirs of dnpbenchs:
$ isola ls dnpbenchs
dnpbenchs:sched_pfor_size_n_no_buf_madm_child_stealing_prof_node_1
dnpbenchs:sched_pfor_size_n_no_buf_madm_child_stealing_ult_prof_node_1
dnpbenchs:sched_pfor_size_n_no_buf_madm_greedy_gc_prof_node_1
dnpbenchs:sched_pfor_size_n_no_buf_madm_waitq_gc_prof_node_1
dnpbenchs:sched_pfor_size_n_no_buf_madm_waitq_prof_node_1
dnpbenchs:sched_rrm_size_n_no_buf_madm_child_stealing_prof_node_1
dnpbenchs:sched_rrm_size_n_no_buf_madm_child_stealing_ult_prof_node_1
dnpbenchs:sched_rrm_size_n_no_buf_madm_greedy_gc_prof_node_1
dnpbenchs:sched_rrm_size_n_no_buf_madm_waitq_gc_prof_node_1
dnpbenchs:sched_rrm_size_n_no_buf_madm_waitq_prof_node_1
The results are stored as a sqlite DB file in each isola.
To check the DB file:
$ sqlite3 $(isola where dnpbenchs:sched_pfor_size_n_no_buf_madm_greedy_gc_prof_node_1)/dnpbenchs/result.db
sqlite> select * from dnpbenchs;
...
The raw result files are also available:
$ ls $(isola where dnpbenchs:sched_pfor_size_n_no_buf_madm_greedy_gc_prof_node_1)/dnpbenchs/out
node_1_n_1048576_m_10000_i_1.txt node_1_n_32768_m_10000_i_1.txt
node_1_n_131072_m_10000_i_1.txt node_1_n_4096_m_10000_i_1.txt
node_1_n_16384_m_10000_i_1.txt node_1_n_524288_m_10000_i_1.txt
node_1_n_2097152_m_10000_i_1.txt node_1_n_65536_m_10000_i_1.txt
node_1_n_262144_m_10000_i_1.txt node_1_n_8192_m_10000_i_1.txt
Variables in dnpbenchs/scripts/batch_sched.bash
:
REPEAT_OUTSIDE
: the number of repetitions outside each program invocationREPEAT_INSIDE
: the number of repetitions inside each program invocation- Initially set to 3; to get more stable results, please increase this value (e.g., 100)
DNP_MAX_NODES
: the number of nodes to be used
First, you need to modify plotting scripts dnpbenchs/plot/sched_bench_size_n.py
and dnpbenchs/plot/sched_bench_prof.py
.
Variables to be set:
machine
: the machine name (which you set inenvs.bash
)nodes
: the number of nodes you used to run benchmarks
To get the plot, run:
cd dnpbenchs/plot
python3 sched_bench_size_n.py
python3 sched_bench_prof.py
It will collect results from isola dirs and output the plot .html files in dnpbenchs/plot
.
You can open them in your browser.
Note: if CPU_SIMD_FREQ
is not properly configured, the parallel efficiency will be incorrect.
Files:
dnpbenchs/uts/*
: UTS benchmark codednpbenchs/measure_jobs/uts.bash
: measurement config for UTSdnpbenchs/scripts/batch_uts_submit.bash
: batch script to run measurement jobs in parallel by using a job management system- need to be manually configured to specify which tree sizes and how many nodes to be used
- You need to configure
dnpbenchs/scripts/submit.bash
by specifying how to submit a job to the job management system in your environment- currently it is mandatory for UTS
- If you feel it is burdensome to configure it, you can manually check the performance of UTS, as described in the "Play with Benchmarks" section below
To submit jobs for UTS with MassiveThreads/DM:
cd dnpbenchs
./scripts/batch_uts_submit.bash <your_machine_name>
Some notes:
- All of these UTS implementations were modified from the original, so that experimental conditions match our settings
- e.g., to run for multiple times within a single program invocation
- These experiments do not require Isola
Files in each dir:
make.bash
: script to compile runtimes and UTS- need to be configured to load environment for compilation
run_uts.bash
: script to run UTS- need to be configured to specify how to invoke
mpirun
command
- need to be configured to specify how to invoke
submit.bash
: script to submit a job to the system job manager- need to be configured to tell how jobs should be submitted to the system job manager
submit_batch.bash
: batch script to submit multiple jobs- need to be manually configured to specify which tree sizes and how many nodes to be used
- to get more stable results, increase
REPEATS
value
collect_result.bash
: collect the results asresult.db
NOTE: the above files do not load environment from envs.bash
; you need to manually configure each of the above files.
Run the following commands to compile, submit, and collect the results in each directory:
cd <runtime>
./make.bash
./submit_batch.bash
# after completion of all jobs
./collect_result.bash
where <runtime>
is either saws
, charm
, or x10
.
You can see the collected results using sqlite3:
$ sqlite3 <runtime>/result.db
sqlite> select * from <runtime>_uts;
...
To get the plot, run:
cd dnpbenchs/plot
python3 uts.py
Variables:
n_runs
: the number of repetitions- it needs to be properly set, because only the last
n_runs
runs are used for plotting (to exclude warm-up runs)
- it needs to be properly set, because only the last
get_n_cores_per_node()
function: the number of cores per node for each machine
Files:
dnpbenchs/lcs.cpp
: LCS benchmark codednpbenchs/measure_jobs/lcs.bash
: measurement config for LCSdnpbenchs/scripts/batch_lcs_submit.bash
: batch script to run measurement jobs in parallel by using a job management system- see the decsription of UTS
To submit jobs:
cd dnpbenchs
./scripts/batch_lcs_submit.bash <your_machine_name>
For waitq_gc_multi
and child_stealing_ult_multi
(TABLE III), please run the LCS benchmark manually by consulting the "Play with Benchmarks" section below.
To get the plot, run:
cd dnpbenchs/plot
python3 lcs.py
Variables:
n_runs
: the number of repetitions- it needs to be properly set, because only the last
n_runs
runs are used for plotting (to exclude warm-up runs)
- it needs to be properly set, because only the last
get_n_cores_per_node()
function: the number of cores per node for each machine
The dnpbenchs/run_jobs
dir contains script files to run each benchmark program.
Please configure the dnpbenchs/run_jobs/common.bash
file similarly to dnpbenchs/measure_jobs/common.bash
To run the synthetic benchmark with greedy_gc
version of MassiveThreads:
cd dnpbenchs
isola run -b massivethreads-dm:greedy_gc ./run_jobs/sched_bench.bash
The output will look like:
...
## Running on 1 nodes
### N = 131072, M = 10000
Work: 6,553,600,000 ns
Span: 50,000 ns
Ideal: 182,044,444 ns
MADM_COMM_ALLOCATOR_INIT_SIZE = 2097152
MADM_FUTURE_POOL_BUF_SIZE = 131072
=============================================================
[pfor]
# of processes: 36
N (Input size): 131072
M (Leaf execution time in ns): 10000
# of blocks: 5
# of repeats: 10
# of warmup runs: 1000
CPU frequency: 3.300000 GHz
# of loop iterations at leaf: 4125
-------------------------------------------------------------
uth options:
MADM_DEBUG_LEVEL = 0, MADM_DEBUG_LEVEL_RT = 5, MADM_CORES = 16, MADM_SERVER_MOD = 0, MADM_GASNET_POLL_THREAD = 0, MADM_GASNET_SEGMENT_SIZE = 0
MADM_STACK_SIZE = 1048576, MADM_TASKQ_CAPACITY = 1024, MADM_PROFILE = 0, MADM_STEAL_LOG = 0, MADM_ABORTING_STEAL = 1
=============================================================
warmup time: 2158617590 ns
[0] 189445692 ns ( #tasks: 655360 work: 6632153500 ns span: 3335224 ns busy: 97.245012 % )
[1] 189493342 ns ( #tasks: 655360 work: 6628007344 ns span: 161585 ns busy: 97.159780 % )
[2] 189273412 ns ( #tasks: 655360 work: 6625861341 ns span: 268257 ns busy: 97.241182 % )
[3] 189092104 ns ( #tasks: 655360 work: 6627959055 ns span: 1926540 ns busy: 97.365236 % )
[4] 189366113 ns ( #tasks: 655360 work: 6626216833 ns span: 228027 ns busy: 97.198794 % )
[5] 189167448 ns ( #tasks: 655360 work: 6626494767 ns span: 405674 ns busy: 97.304954 % )
[6] 189203665 ns ( #tasks: 655360 work: 6625076636 ns span: 97958 ns busy: 97.265508 % )
[7] 188743016 ns ( #tasks: 655360 work: 6625011176 ns span: 132042 ns busy: 97.501933 % )
[8] 189261201 ns ( #tasks: 655360 work: 6624877476 ns span: 205415 ns busy: 97.233016 % )
[9] 188965016 ns ( #tasks: 655360 work: 6624631946 ns span: 157376 ns busy: 97.381811 % )
Notes:
uth options
are obsolete.- For each result line:
- The execution time is shown in nanoseconds
#tasks
: the number of leaf taskswork
: measured total workspan
: measured span (critical path length)busy
: busyness of workers (work / (execution time) / (# of processes)
)
M (Leaf execution time in ns)
should be (almost) equal towork / #tasks
.- e.g., for 0th iteration in the above case,
6632153500 / 655360 = 10120
, which is close toM=10000
- If not, environment variable
CPU_SIMD_FREQ
is incorrect
- e.g., for 0th iteration in the above case,
You can configure benchmark settings in dnpbenchs/run_jobs/sched_bench.bash
; for example, by commenting out compute_pattern=rrm
, you can run RecPFor benchmark.
By default, three builds greedy_gc_trace
, waitq_gc_trace
, and child_stealing_ult_trace
are available for tracing.
To trace the execution of synthetic benchmarks, you need to set prof=1
in dnpbenchs/run_jobs/sched_bench.bash
.
An example to trace execution with greedy_gc_trace
:
cd dnpbenchs
isola create -b massivethreads-dm:greedy_gc_trace -o test ./run_jobs/sched_bench.bash
The trace output will be saved in isola dnpbenchs:test
.
Then, you can visualize the timeline (like Fig. 7) by running:
cd dnpbenchs/plot
python3 timeline.py
(it will take a lot of time if the problem size is large)
Additional notes:
- The output
mlog.ignore
can be viewed as a timeline (gantt chart) by using massivelogger viewer - Other events are also available; you can add more events to
enabled_kinds
inmassivethreads-dm/make_common.bash
- list of events:
massivethreads-dm/greedy_gc/comm/include/madm_logger_kind.h
- list of events:
The raw_results
dir contains raw data and generated figures used for the paper.
Subdirs:
raw_results/ito-a
- ITO supercomputer (subsystem A) at Kyushu University
raw_results/wisteria-o
- Wisteria/BDEC-01 Odyssey at The University of Tokyo
viewer.py
will launch a convenient viewer for the result HTML files:
cd raw_results/ito-a
python3 viewer.py
Each subdir also includes python scripts to generate plots from raw data.
For example, if you want to generate plots for UTS on ITO-A, run:
cd raw_results/ito-a
python3 uts.py