Takeaways from LUMI Pilot program

Question

Takeaways from LUMI Pilot program

mdzik opened this issue 2 years ago · comments

TCLB was part of LUMI (https://lumi-supercomputer.eu/lumis-second-pilot-phase-in-full-swing/) Pilot Program which is now ending.

Apart from performance results, there are some issues that might be worth consideration. LUMI is a brand new CRAY/HPE computer with AMD Instinct MI250X 128GB HBM2e cards.

Performance: is not there yet, while scalability up to 1024 GPUs works nice, per-GPU performance is unsatisfactory. Some profiling data is available, we could get more. This is a subject for another talk/issue
rinside: HPE/CRAY decided that R is statically linked which prevents from building rcpp/rinside. HPC centers will resist installing non-HPE versions of R, so user will need to build it themselves. This makes life harder, especially on Cray :/
GPU per Core/Process allocation should be done in some specific order. This could be handled by shell script and ROCM_VISIBLE_GPU variable but TCLB requires "gpu_oversubscribe="True" in XML to run. I think that configure flag to disable it would be nice. Imagine you copy XML from other system, and it will fail after 2 days in queue because of that.
More generic in-situ processing would be nice, like ADIOS. I could work on that, but let me know if you are interested

As for results, I made 0.8e9 lattice dissolution simulation for AGU :D That is around half of the 12cm experimental core at 30um resolution.

I still have few days left - if you want to check something we could do it.

Łukasz Łaniewski-Wołłk · Answer 1 · Thu Dec 29 2022 20:00:21 GMT+0800 (China Standard Time)

Thanks for sharing this info.

Ad 1: Can you share some performance data? Eg. how it compares to nVidia?

Ad 2: I think for now compilation of R is the only option. Thankfully I did not have problem with compiling R on any architecture yet.

Ad 3: It's interesting. Setting gpu_oversubscribe to true means TCLB sees only one card. Did you bind GPUs or set ROCM_VISIBLE_GPU? Can you share sbatch scripts?

Ad 4: I remember you were doing some ADIOS integration. How does ADIOS relate to for example Catalyst? And how does it relate to HDF5? Feel free to start another issue on this subject, as I think this doesn't relate much to HIP/LUMI

Michał Dzikowski · Answer 2 · Thu Dec 29 2022 20:29:27 GMT+0800 (China Standard Time)

Around 50% of V100 performance. And V100 was in double, MI250 was on float. That was bit by mistake, but I don,t expect it will be faster on doubles than floats?
Problem is, it's easy for us. Some instructions how to do it to get it working with TCLB might be usefull for generic users ;)

cat run.slurm

#!/bin/bash -l

. ~/tclb_env.sh

BIN="/users/XXXX/TCLB/CLB/auto_porous_media_d3q19_TRT_GinzburgEqOrd1_FlowInZ/main ./RunMe.xml"
srun /users/XXXX/select_gpu.sh $BIN

/users/XXX/select_gpu.sh:

#!/bin/bash
GPUSID="4 5 2 3 6 7 0 1"
GPUSID=(${GPUSID})
if [ ${#GPUSID[@]} -gt 0 ]; then
export ROCR_VISIBLE_DEVICES=${GPUSID[$((SLURM_LOCALID / ($SLURM_NTASKS_PER_NODE / ${#GPUSID[@]})))]}
fi
echo "Selected GPU: $ROCR_VISIBLE_DEVICES"
exec $*

ADIOS could work as middle man, as far as I can tell. So TCLB will pipeline data output to it, and the rest will happen outside TCLB. Paraview have support for it, and you could write some in-pileline operations if you want. And it can write hdf files, I believe. Currently, paraview in-situ is "bit" complicated to setup, but it is really nice tool. Especially, when single hdf5 output file is around 45G of data and you need ~200G of memory to load it. The question is: is there any interest, beside mine, for it?

Michał Dzikowski · Answer 3 · Thu Dec 29 2022 20:46:44 GMT+0800 (China Standard Time)

Some side notes, I forgot:

Those should be turned off as default (or at configure?). It's 1000s of files/lines for big runs:

MPMD: TCLB: local:108/128 work:108/128 --- connected to:
.xml files saved in output directory per process

And maybe we could support data-dumps to h5 format instead of lots of pri? or default to DUMPNAME/slice.pri. Again lots of files per folder, and it gets complicated to handle when doing restarted simulations. Most of the time you have ~2days worth.

There is also numbering error in at least pri - we assumed that 99 GPUs is enough, and leading zero is messy to handle in batch ;)

for d in {0..9}; do ln -s ./output/Restart-"$PREV"_Save_P0"$d"_00000000_"$d".pri restart_"$NEXT"_"$d".pri; done
for d in {10..1023}; do ln -s ./output/Restart-"$PREV"_Save_P"$d"_00000000_"$d".pri restart_"$NEXT"_"$d".pri; done

Łukasz Łaniewski-Wołłk · Answer 4 · Fri Dec 30 2022 09:01:54 GMT+0800 (China Standard Time)

Ad printouts and xml: I agree it's cumbersome with many threads. There is couple of prints to be cleaned and xml files can be fixed to just export a single one.

Ad dumps: The "pri" files were just the simples (and fastest) way of doing things. We can switch to h5, but I think this should be as an option, as I don't want hdf5 to be an must dependency. As for the file mess, we can come up with some good way to arrange them. You can use the keep parameter in SaveCheckpoint to set how many previous dumps you want to store. It's implemented in a safe way (it writes before deleting the old ones).

Ad numbering: that is true that it's designed to have two digits for the rank, but I don't know if changing it now is a good idea. I agree it's a bit messy, but you can use: seq -f '%02.0f' 0 200 in bash to generate sequence with two digit formatting.

Łukasz Łaniewski-Wołłk · Answer 5 · Fri Dec 30 2022 09:04:44 GMT+0800 (China Standard Time)

As for performance, @mdzik could you run some tests to compare double to double performance between nvidia and AMD? The "theory" is that AMD cards are designed for double precision.

Michał Dzikowski · Answer 6 · Sat Dec 31 2022 01:48:58 GMT+0800 (China Standard Time)

As for performance, @mdzik could you run some tests to compare double to double performance between nvidia and AMD? The "theory" is that AMD cards are designed for double precision.

AFAIK I was mistaken - binary was double precission