ecmwf / eckit

A C++ toolkit that supports development of tools and applications at ECMWF.

Home Page:https://confluence.ecmwf.int/display/eckit

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

eckit::Comm constructor in a container

mmiesch opened this issue · comments

I'm trying to run an eckit application (JEDI) in a singularity container on an HPC system. In order to run on multiple nodes (864 MPI tasks, 20 tasks/node), I'm using what Singularity describes as the hybrid model, meaning that I'm using the native (system) MPI but I also have MPI installed in the container. The mpi versions are compatible - I'm using Intel MPI version 17.0.6 outside the container and Intel MPI version 17.0.1 inside the container. The code in the container is compiled with Intel 17.0.1 compilers (C++, C, and Fortran). slurm is the parallel job scheduler.

I don't know if you're familiar with singularity, but, for what it's worth, this is my run command:

srun --ntasks=864 --cpu_bind=cores --distribution=block:block --verbose singularity exec -e --home=$PWD $JEDICON/jedi-intel17-impi-hpc-dev.sif ${JEDIBIN}/fv3jedi_var.x Config/3dvar_bump.yaml

The -e option to singularity exec cleans the environment, so the container does not see any of the environment variables on the host system. I find that this is required in order to successfully run an mpi "hello world" program. If I leave it out, then I get errors from singularity that it cannot read the slurm configuration file.

So here is the problem. JEDI uses eckit to do its MPI initialization. When I include the -e option (required for hello world as described above), then the eckit::Comm object is not properly constructed. It mistakenly instantiates the Serial subclass instead of Parallel and the application fails because all 864 MPI processes think that they are rank 0. But, if I omit the -e option (so all environment variables on the host are available to the application), then I get the same slurm configuration error as with hello_world. But, at least it calls the Parallel constructor.

So, in summary, I believe eckit is somehow reading environment variables to determine whether or not the application is parallel - i.e. whether to instantiate the Parallel or Serial subclass of eckit::Comm. My question is: which variables? I know of at least one place where eckit accesses the environment variable SLURM_NTASKS to determine whether or not the job is parallel:

::getenv("SLURM_NTASKS"))) { // slurm srun

But this is not enough. When I use the -e (clean environment option) and explicitly export SLURM_NTASKS to the container, I still get the Serial instantiation of eckit::Comm. I see here that another requirement for instantiating the Parallel class is that the CMake variable ECKIT_HAVE_MPI is set:

#ifdef ECKIT_HAVE_MPI

But this is indeed set - the code was compiled with MPI enabled, as seen for example, here in the ecbuild.log file:

2020-03-14T14:12:06 - fckit - DEBUG - ecbuild_evaluate_dynamic_condition(_fckit_test_mpi_condition): checking condition 'HAVE_ECKIT AND ECKIT_HAVE_MPI' -> TRUE

What other information from the runtime environment does eckit need to determine that the application is parallel? In other words - what more can I do to force eckit to instantiate the Parallel class? Thanks for any insight you can provide.

I should also mention that, as noted in the previous, related issue, I have set these both inside and outside the container (in addition to SLURM_NTASKS):

export SLURM_EXPORT_ENV=ALL
export ECKIT_MPI_FORCE=parallel

Update - I discovered that the problem is not with eckit after all - MPI_Comm_size() and MPI_Comm_rank() are returning 1 and 0 for all tasks. So - this is an MPI issue. I'll close this issue for now until I investigate further - sorry for the false alarm.