eckit::Comm constructor in a container
mmiesch opened this issue · comments
I'm trying to run an eckit application (JEDI) in a singularity container on an HPC system. In order to run on multiple nodes (864 MPI tasks, 20 tasks/node), I'm using what Singularity describes as the hybrid model, meaning that I'm using the native (system) MPI but I also have MPI installed in the container. The mpi versions are compatible - I'm using Intel MPI version 17.0.6 outside the container and Intel MPI version 17.0.1 inside the container. The code in the container is compiled with Intel 17.0.1 compilers (C++, C, and Fortran). slurm is the parallel job scheduler.
I don't know if you're familiar with singularity, but, for what it's worth, this is my run command:
srun --ntasks=864 --cpu_bind=cores --distribution=block:block --verbose singularity exec -e --home=$PWD $JEDICON/jedi-intel17-impi-hpc-dev.sif ${JEDIBIN}/fv3jedi_var.x Config/3dvar_bump.yaml
The -e
option to singularity exec
cleans the environment, so the container does not see any of the environment variables on the host system. I find that this is required in order to successfully run an mpi "hello world" program. If I leave it out, then I get errors from singularity that it cannot read the slurm configuration file.
So here is the problem. JEDI uses eckit to do its MPI initialization. When I include the -e
option (required for hello world as described above), then the eckit::Comm
object is not properly constructed. It mistakenly instantiates the Serial
subclass instead of Parallel
and the application fails because all 864 MPI processes think that they are rank 0. But, if I omit the -e
option (so all environment variables on the host are available to the application), then I get the same slurm configuration error as with hello_world. But, at least it calls the Parallel
constructor.
So, in summary, I believe eckit is somehow reading environment variables to determine whether or not the application is parallel - i.e. whether to instantiate the Parallel or Serial subclass of eckit::Comm. My question is: which variables? I know of at least one place where eckit accesses the environment variable SLURM_NTASKS
to determine whether or not the job is parallel:
Line 46 in 341babd
But this is not enough. When I use the -e
(clean environment option) and explicitly export SLURM_NTASKS
to the container, I still get the Serial instantiation of eckit::Comm
. I see here that another requirement for instantiating the Parallel class is that the CMake variable ECKIT_HAVE_MPI
is set:
Line 26 in 341babd
But this is indeed set - the code was compiled with MPI enabled, as seen for example, here in the ecbuild.log
file:
2020-03-14T14:12:06 - fckit - DEBUG - ecbuild_evaluate_dynamic_condition(_fckit_test_mpi_condition): checking condition 'HAVE_ECKIT AND ECKIT_HAVE_MPI' -> TRUE
What other information from the runtime environment does eckit need to determine that the application is parallel? In other words - what more can I do to force eckit to instantiate the Parallel class? Thanks for any insight you can provide.
I should also mention that, as noted in the previous, related issue, I have set these both inside and outside the container (in addition to SLURM_NTASKS):
export SLURM_EXPORT_ENV=ALL
export ECKIT_MPI_FORCE=parallel
Update - I discovered that the problem is not with eckit after all - MPI_Comm_size()
and MPI_Comm_rank()
are returning 1 and 0 for all tasks. So - this is an MPI issue. I'll close this issue for now until I investigate further - sorry for the false alarm.