ecmwf / eckit

A C++ toolkit that supports development of tools and applications at ECMWF.

Home Page:https://confluence.ecmwf.int/display/eckit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

MPI problems with srun

mmiesch opened this issue · comments

Many clusters use slurm to manage MPI processes across multiple nodes. Often one can use mpirun or mpiexec within a slurm batch script but sometimes cluster admins may configure the system to use srun instead. We're finding problems in the eckit mpi initialization for jobs run with srun.

The problem is with the size() and rank() methods of eckit::mpi::Comm. These return values of 1 and 0 for all MPI tasks when executables are executed with srun. Strangely, this does not trigger test failures in eckit_test_mpi - the tests pass, presumably because the mpi calls don't return error codes when run with a single node (comm size 1) - for example, node 0 can happily send and receive to itself. Meanwhile, other tests, such as eckit_test_mpi_addcom hang.

However, you can make a small change to eckit/test/mpi/eckit_test_mpi.cc adding a third check to the test_rank_size case as follows:

CASE("test_rank_size") {
    EXPECT_NO_THROW(mpi::comm().size());
    EXPECT_NO_THROW(mpi::comm().rank());
    EXPECT(mpi::comm().size() > 1);
}

Then, after compilation, one can test it with each of the following two commands:

mpiexec -n 4 eckit/tests/mpi/eckit_test_mpi 
srun -n 4 eckit/tests/mpi/eckit_test_mpi 

In the first case (mpiexec) all tests pass. In the second case, there is one (and only one!) test failure, namely test_rank_size. If one investigates this further with a debugger or print statements, one finds that indeed, all tasks believe that their rank is 0 and that the comm size is 1. I should add that the result is the same if one passes "world" to comm(): eckit::mpi::comm("world").

Some mpi jobs still appear to run fine (and many mpi tests pass). However, any jobs that query eckit::mpi::Comm to get their rank and then use this information for some purpose (such as data partitioning) will likely fail.

We believe that it is a problem with eckit because a simple "hello world" test program such as this runs fine with both mpirun/mpiexec and srun:

#include <mpi.h>
#include <stdio.h>

int main(int argc, char **argv)
{
  char nodename[128];
  int commsize;
  int resultlen;
  int iam;
  int ret;

  ret = MPI_Init (&argc, &argv);
  ret = MPI_Comm_size (MPI_COMM_WORLD, &commsize);
  ret = MPI_Comm_rank (MPI_COMM_WORLD, &iam);
  ret = MPI_Get_processor_name (nodename, &resultlen);

  printf ("Hello from rank %d of %d running on %s\n", iam, commsize, nodename);
  ret = MPI_Finalize ();
}

We have seen this issue on two different platforms (Amazon Web Services and the S4 cluster at the Univ. of Wisconsin) and with multiple compiler/mpi combinations: intel+intel mpi version 17.0.1 and 17.0.6 and gnu 7.4.0 / openmpi-3.1.2.

Has anyone run into this problem before?

Can you check first that eckit has MPI enabled in the CMake configuration log? I guess so but just to be sure.

Then, how eckit detects that it is running in an MPI environment it checks at run-time for following environment variables to choose either true MPI or a Serial MPI stubs implementation.
https://github.com/ecmwf/eckit/blob/develop/src/eckit/mpi/Comm.cc#L43-L45

I should mention that I have been able to use srun in the past without issues, provided that the environment variable export SLURM_EXPORT_ENV=ALL is set, which is required to propagate environment variables to the srun'd program.

If that doesn't work, you can force eckit to choose the "parallel" (MPI) implementation by setting the environment variable:
export ECKIT_MPI_FORCE=parallel

Thanks @wdeconinck for the helpful information. It turns out that MPI was enabled and the SLURM_EXPORT_ALL did not help. But, the export ECKIT_MPI_FORCE=parallel command does seem to work!

Though this does work for parallel applications, it's not an ideal solution for unit testing with ctest, of course. Setting this variable helps parallel tests to pass but it breaks all the serial tests. So, if you have any other ideas on how one might get around this please let me know. Thanks again.

Ideally the fix is to figure out which environment variable we can add to the list in eckit to check for.
e.g. OMPI_COMM_WORLD_SIZE, ALPS_APP_PE, PMI_SIZE
To figure it out, one could run:

srun -n 1 env

and see which environment variable srun adds that we can detect MPI with. You can then test it by adding it to the code here https://github.com/ecmwf/eckit/blob/develop/src/eckit/mpi/Comm.cc#L43-L45. If it works, we can then patch eckit.

Another ugly workaround or hack which I have used, that would satisfy you for now is to create a srun wrapper called e.g. parallel_run containing the lines:

export ECKIT_MPI_FORCE=parallel
srun "$@"

Then CMake-configure with -DMPIEXEC_EXECUTABLE=<path/to/parallel_run> -DMPIEXEC_NUMPROC_FLAG='-n'

Thanks @wdeconinck - the slurm output env variables are listed here - confirmed with the srun -n1 env command you suggested. I think the best bet would likely be SLURM_NTASKS.

I will edit the code as suggested and let you know if it works.

Thanks for the workaround suggestion but I think this would have the same problem as before - all serial jobs would fail because they would invoke the Parallel subclass of eckit::mpi::Comm, right?

Thanks again @wdeconinck - your suggested solution works. I could do a formal pull request but it might be easier for you to just apply the patch on your end.

Here is the code that works:

        else if (::getenv("OMPI_COMM_WORLD_SIZE") ||  // OpenMPI
                 ::getenv("ALPS_APP_PE") ||           // Cray PE
                 ::getenv("PMI_SIZE") ||              // Intel
                 ::getenv("SLURM_NTASKS")) {          // slurm srun

Thank you @mmiesch , the modification has been submitted as PR upstream and should make its way soon.

Great - Thanks again @wdeconinck for your help.

@tlmquintino This issue can be closed now ( I don't have permissions ).