STEllAR-GROUP / hpx

The C++ Standard Library for Parallelism and Concurrency

Home Page:https://hpx.stellar-group.org

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

HPX error: "Host not found" when running on Expanse with 128 nodes

JiakunYan opened this issue · comments

Expected Behavior

The program runs successfully.

Actual Behavior

The program gives me a bunch of errors:

hpx::init: hpx::exception caught: resolve: Host not found (authoritative) (while trying to resolve: exp-2-19:0): HPX(network_error)
hpx::init: hpx::exception caught: resolve: Host not found (authoritative) (while trying to resolve: exp-2-19:0): HPX(network_error)
hpx::init: hpx::exception caught: resolve: Host not found (authoritative) (while trying to resolve: exp-5-11:0): HPX(network_error)

Steps to Reproduce the Problem

I have only seen this error when running HPX on SDSC Expanse with 64/128 nodes. I have disabled the TCP Parcelport.

$ cat ~/opt/hpx/local/build/CMakeCache.txt | grep TCP
//Enable the TCP based parcelport.
HPX_WITH_PARCELPORT_TCP:BOOL=OFF

Command line I used to launch this program (I only kept arguments related to HPX)

srun --mpi=pmix octotiger --hpx:ini=hpx.stacks.use_guard_pages=0 --hpx:ini=hpx.parcel.mpi.priority=1000 --hpx:ini=hpx.parcel.mpi.zero_copy_serialization_threshold=4096 --hpx:threads=128 --hpx:ini=hpx.agas.use_caching=0 --hpx:ini=hpx.parcel.zero_copy_receive_optimization=1

Does this also happen with the TCP parcel port being enabled? Also, could you provide a stack backtrace to the point where the exception is thrown?

Yes, it also happens with TCP parcelport enabled. HPX did not give me a stack backtrace after the exception. It seems the program just hanged and was then killed by slurm.

Tried running the program with two nodes using --hpx:debug-clp. Here is a snippet of the logging I got:

SLURM nodelist found (SLURM_STEP_NODELIST): exp-4-[08-09]
SLURM nodelist found (SLURM_STEP_NODELIST): exp-4-[08-09]
batch_name: SLURM
batch_name: SLURM
num_threads: 128
node_num_: 1
num_threads: 128
num_localities: 2
got node list
node_num_: 0
num_localities: 2
extracted: 'exp-4-08'
got node list
extracted: 'exp-4-08'
incrementing agas_node_num
extracted: 'exp-4-09'
incrementing agas_node_num
extracted: 'exp-4-09'
incrementing agas_node_num
incrementing agas_node_num
using AGAS host: 'exp-4-08' (node number 0)
Nodes from nodelist:
using AGAS host: 'exp-4-08' (node number 0)
Nodes from nodelist:
exp-4-08: 1 (10.21.4.8:0)
exp-4-09: 1 (10.21.4.9:0)
exp-4-08: 1 (198.202.103.67:0)
agas host_name: exp-4-08
exp-4-09: 1 (198.202.103.66:0)
agas host_name: exp-4-08
asio host_name: exp-4-09
asio host_name: exp-4-08
host_name: exp-4-09
host_name: exp-4-08
resolved: 'exp-4-09' to: 198.202.103.66
resolved: 'exp-4-08' to: 198.202.103.67
resolved: 'exp-4-08' to: 198.202.103.67