`gossip`: Efficient Communication Primitives for Multi-GPU Systems

Gossip supports scatter, gather and all-to-all communication. To execute one of the communication primitives a transfer plan is needed. Use the provided scripts to generate optimized plans for your specific NVLink topology. The plans directory contains optimized plans for typical 4 GPU configurations (P100 and V100) as well as 8 GPU DGX-1 Volta. If no transfer plan is provided gossip will fall back to the default strategy using direct transfers between GPUs.

Gossip was presented at ICPP '19.

Using gossip

To use gossip clone this repository and check out the submodule hpc_helpers by calling git submodule update --init include/hpc_helpers. Include the header gossip.cuh in your project which provides all communication primitives. To parse transfer plans make use of the plan parser which can be compiled as a separate unit like in the example Makefile.

Examples

The example execute.cu executes gossip's communication primitives on uniformly distributed random numbers. The data is first split into a number of chunks corresponding to the number of GPUs (multisplit). The chunks sizes are displayed as a partiton table (row=source GPU, column=target GPU). Then the data is transferred between the GPUs according to the provided transfer plan. At the end it validates if all data reached the correct destinations.

The example simulate.cu allows to run the multi-GPU example above simulated on a single GPU.

Build example

Compile the example using the provided Makefile by calling git submodule update --init && make.

Requirements:

CUDA >= 9.2
GNU g++ >= 5.5 compatible with your CUDA version
Python >= 3.0 including
- Matplotlib
- NumPy

Run example

./execute (all2all|all2all_async) <transfer plan> [--size <size>] [--memory-factor <factor>]

./execute scatter_gather <scatter plan> <gather plan> [--size <size>] [--memory-factor <factor>]

Use ./simulate instead of ./execute if you want to simulate the example on a single GPU.

Mandatory:

Choose all2all (double buffered), all2all_async or scatter_gather mode
Provide path(s) to transfer plan(s) (one for all2all, two for scatter+gather)

Optional:

Choose data size (2^<size> 64-bit elements per GPU) (default: 28)
Choose memory factor (account for random transfer sizes) (default: 1.5)

Benchmark

For benchmark scripts and results see the benchmark directory.

Funatiq / gossip

`gossip`: Efficient Communication Primitives for Multi-GPU Systems

Using gossip

Examples

Build example

Run example

Benchmark

About

Languages

gossip: Efficient Communication Primitives for Multi-GPU Systems

Using gossip

Examples

Build example

Run example

Benchmark

About

Languages

`gossip`: Efficient Communication Primitives for Multi-GPU Systems