HPCToolkit / hpctoolkit

HPCToolkit performance tools: measurement and analysis components

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

hpcrun hangs, please assist

mfbarad opened this issue · comments

When I run our executable the code gets stuck on startup. Our code is dynamically linked and uses mpi. I have tried running it various ways and it still hangs, the simplest version is as follows:

hpcrun app inputfile

Also hanging:
hpcrun -t -e PAPI_TOT_CYC app inputfile
mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC app inputfile

I put a std::cout << "debug" << std::endl first thing in main and that never shows up. In another terminal when I do top, the code is using 100% of the resources per core, so it seems to be doing something. Is this just a matter of us not waiting long enough? Not sure how long to let it run for as there is no indicator of progress.

When I do "hpcrun ls" it does not hang and seems to produce something usable.

I built hpctoolkit using spack following your install directions. We are on TOSS3 / RHEL7.

We are new to hpctoolkit so likely we are making a simple mistake.
Thanks,
Mike

In general, HPCToolkit supports measuring dynamically-linked MPI applications.

If when you say that

hpcrun app inputfile

fails, do you mean with an application that is compiled with OpenMPI that should self-launch when run without an MPI launcher? If so, that is a known issue. You should use an MPI launcher to get around the problem.

mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC app inputfile

This should work. Are you running the 2022.10 release of HPCToolkit or not? Does your MPI application use GPUs or not?

Hi John,

$ hpcrun --version
hpcrun: A member of HPCToolkit, version 2022.10.01-release
git branch: unknown (not a git repo)
spack spec: hpctoolkit@2022.10.01%gcc@10.2.0craycudadebuglevel_zerompiopencl+papi~rocm+viewer build_system=autotools arch=linux-rhel7-x86_64/x4hfzn4
install dir: /swbuild/mbarad/LAVA_GROUP/LAVA_deps/spack/linux-rhel7-x86_64/gcc-10.2.0/hpctoolkit-2022.10.01-x4hfzn4wjy67u36chwneyciibnimekge

This version of the app is CPU only (no GPUs). The MPI is HPE MPT, on NASA's Pleiades supercomputer.

Thanks for your help,
Mike

Is there anything else that I can do to help figure this out? It would be great to get it working. We have a bunch of NASA users who will benefit from this. Thanks

Can you give us a backtrace from a hanging process? Attach to one of your MPI ranks with gdb and then ask for a backtrace using the backtrace command.

That will give us a sense of what is happening and hopefully help us understand how to fix the problem.

You might try using a trivial MPI program instead of your real application to see if that also causes the hang.

We have some simple regression tests for this purpose.

git clone https://github.com/hpctoolkit/hpctoolkit-tests
cd hpctoolkit-tests/applications/loop-suite/5.loop-mpi-cputime

make

If you have an mpicc in your path, this will build and attempt to run the binary. You may need to launch the binary yourself on the compute node with

mpiexec -perhost 2 hpcrun -t -e CPUTIME ./loop

If that works, you can also try

mpiexec -perhost 2 hpcrun -t -e cycles ./loop

and

mpiexec -perhost 2 hpcrun -t -e PAPI_TOT_CYC ./loop

I'll note that we had some trouble with HPE MPI before. That led us to write the following https://bit.ly/glibc-ldaudit and engage with Red Hat to fix a Linux monitoring interface we need that has been broken forever.

If you look at the motivation, you'll see that our intro complains about what HPE's SGI MPI does. That may be related to your trouble.