NLKNguyen / alpine-mpich

MPI Cluster Automation Solution using Docker, based on Alpine Linux with MPICH (see IEEE paper)

Home Page:https://github.com/NLKNguyen/alpine-mpich

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

all service containers of a service are not fetched in /etc/opts/hosts file

opened this issue · comments

I have created a service with 16 containers and running an MPI task from the master node. I have noticed that not all the service containers are taking the load. Then I opened the /etc/opts/hosts file which is supposed to have a list of all service containers but I found most of the time 2-3 containers are not listed in it.

I have figured it out that this is an issue with "netstat -t" command inside get_hosts, which can not resolve all containers name and hence returning fewer addresses most of the time.

Are you using the Single Host or Multi Host orchestration? and what is the version of Docker?

I notice in the Multi Host solution, the availability of all services is sometimes late, and I have to rerun the commands to get them all up.

Any alternative suggestion to netstat -t is welcome. At some point I'll look into the new Docker (haven't checked since January but heard some big noise in the Summer) to see what's been updated that can provide better solution to this topic.

I am using multiple host and docker version is 1.16.0
"netstat" is slow and it not picking all the containers address.
I made a local script which prepares the list of hosts and scp the file into the master container before login and starting the mpi task from inside.
"docker service ps --no-trunc master-service-name"
"docker service ps --no-trunc worker-service-name" commands gives all required literals to prepre the hostFile.

I did it in java/python but to keep your project as it is, it will be better to use another shell script to populate the same.

I noticed similar issues while running MPI jobs. Some of the worker nodes occasionally get lost from the /etc/opts/hosts. It won't cause problems when running a short MPI job, but it will hang there forever for some longer jobs.

Any ideas to bring the hanging jobs back?

This might be a similar issue to #4 and netstat.

I've produced a solution using dig based on https://stackoverflow.com/questions/49446165/how-to-get-all-ip-addresses-on-a-docker-network

I'll make a pull request.