ct-clmsn / zmq-collectives-rs

SPMD (HPC) collective communication algorithms for Rust using zeromq

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

This library implements a SPMD (single program multiple data) model and collective communication algorithms (Robert van de Geijn's Binomial Tree) in Rust using 0MQ. The library provides log2(N) algorithmic performance for each collective operation over N compute hosts.

Collective communication algorithms are used in HPC (high performance computing) / Supercomputing libraries and runtime systems such as MPI and OpenSHMEM.

Documentation for this library can be found on it's wiki.

Algorithms Implemented

  • Broadcast
  • Reduction
  • Scatter
  • Gather
  • Barrier

Configuring Distributed Program Execution

This library requires the use of environment variables to configure distributed runs of SPMD applications. Each of the following environment variables needs to be supplied to correctly run programs:

  • ZMQ_COLLECTIVES_NRANKS
  • ZMQ_COLLECTIVES_RANK
  • ZMQ_COLLECTIVES_ADDRESSES

ZMQ_COLLECTIVES_NRANKS - unsigned integer value indicating how many processes (instances or copies of the program) are running.

ZMQ_COLLECTIVES_RANK - unsigned integer value indicating the process instance this program represents. This is analogous to a user provided thread id. The value must be 0 or less than ZMQ_COLLECTIVES_NRANKS.

ZMQ_COLLECTIVES_ADDRESSES - should contain a ',' delimited list of ip addresses and ports. The list length should be equal to the integer value of ZMQ_COLLECTIVES_NRANKS. An example for a 2 rank application name app is below:

ZMQ_COLLECTIVES_NRANKS=2 ZMQ_COLLECTIVES_RANK=0 ZMQ_COLLECTIVES_ADDRESSES=127.0.0.1:5555,127.0.0.1:5556 ./app

ZMQ_COLLECTIVES_NRANKS=2 ZMQ_COLLECTIVES_RANK=1 ZMQ_COLLECTIVES_ADDRESSES=127.0.0.1:5555,127.0.0.1:5556 ./app

In this example, Rank 0 maps to 127.0.0.1:5555 and Rank 1 maps to 127.0.0.1:5556.

HPC batch scheduling systems like Slurm, TORQUE, PBS, etc. provide mechanisms to automatically define these environment variables when jobs are submitted.

Notes

0MQ uses sockets/file descriptors (same thing) to handle communication and asynchrony control. There is a GNU/Linux kernel configurable ~2063 default limit on the number of file descriptors/sockets a user process is authorized to open during execution. The TcpBackend uses 2 file descriptors/sockets. In 0MQ terms these sockets are ZMQ_ROUTER.

tcp is a "chatty" protocol; tcp requires round trips between clients and servers during the data transmission exchange to ensure data is communicated correctly. The use of this protocol makes it less than ideal for jobs requiring high performance. However, tcp is provided in 0MQ and is universally accessible (tcp is a commodity protocol) and makes for a reasonable place to plant a flag for providing an implementation.

This library requires libzmq. LD_LIBRARY_FLAGS and PKG_CONFIG_PATH needs to point to the directories that the libzmq library has been is installed. As an example, let's say a user has installed libzmq into a directory with the environment variable named:

$LIBZMQ_INSTALL_PREFIX_PATH

libzmq.a or libzmq.so would be installed in the directory: $LIBZMQ_INSTALL_PREFIX_PATH/lib

libzmq.pc can be found in the directory: $LIBZMQ_INSTALL_PREFIX_PATH/lib/pkgconfig

License

Boost Version 1.0

Date

03MAY2021

Author

Christopher Taylor

Dependencies

About

SPMD (HPC) collective communication algorithms for Rust using zeromq

License:Boost Software License 1.0


Languages

Language:Rust 100.0%