minsii/armci-mpi

                    ARMCI on MPI-RMA Implementation Notes
                       James Dinan <dinan@mcs.anl.gov>

===============================================================================
Introduction
===============================================================================

This project provides a full, high performance, portable implementation of the
ARMCI runtime system using MPI's remote memory access (RMA) functionality.

===============================================================================
Installing Only ARMCI-MPI
===============================================================================

ARMCI-MPI uses autoconf and must be configured before compiling:

 $ ./configure

Many configure options are provided, run "configure --help" for details.  After
configuring the source tree, the code can be built and installed by running:

 $ make && make install

The quality of MPI-RMA implementations varies.  As of August, 2011 the
following MPI implementations are known to work correctly with ARMCI-MPI:

 + MVAPICH2 1.6
 + MPICH2
 + Cray MPI on Cray XE6
 + IBM MPI on BG/P (set ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED)
 + OpenMPI 1.5.4 (set ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED)

The following MPI implementations are known to fail with ARMCI-MPI:

 - MVAPICH2 prior to 1.6

===============================================================================
Installing Global Arrays with ARMCI-MPI
===============================================================================

ARMCI-MPI has been tested with GA 5.0.2.  To build GA with ARMCI-MPI, rename
this directory to "armci" and substitute it for the "armci" directory in the GA
distribution.  Configure and build GA as usual; no special flags are required.

===============================================================================
The ARMCI-MPI Test Suite
===============================================================================

ARMCI-MPI includes a set of testing and benchmark programs located under tests/
and benchmarks/.  These programs can be compiled and run via:

$ make check MPIEXEC="mpiexec -n 4"

The MPIEXEC variable is optional and is used to override the default MPI launch
command.  If you want only to build the test suite, the following target can be
used:

$ make checkprogs

===============================================================================
ARMCI-MPI Errata
===============================================================================

Direct access to local buffers:

 * Because of MPI's semantics, you are not allowed to access shared memory
   directly, it must be through put/get.  Alternatively you can use the 
   new ARMCI_Access_begin/end() functions.
   
Progress semantics:

 * On some MPI implementations and networks you may need to enable implicit
   progress.  In many cases this is done through an environment variable.  For
   MPICH2: set MPICH_ASYNC_PROGRESS; for MVAPICH2 recompile with
   --enable-async-progress and set MPICH_ASYNC_PROGRESS; set DCMF_INTERRUPTS=1
   for MPICH2-BG; etc.

===============================================================================
Environment Variables:
===============================================================================

Boolean environment variables are enabled when set to a value beginning with
't', 'T', 'y', 'Y', or '1'; any other value is interpreted as false.

 -------------------
: Debugging Options :
 -------------------

ARMCI_VERBOSE (boolean)

  Enable extra status output from ARMCI-MPI.

ARMCI_DEBUG_ALLOC (boolean)

  Turn on extra shared allocation debugging.

ARMCI_FLUSH_BARRIERS (boolean)

  Enable/disable extra communication flushing in ARMCI_Barrier.  Extra flushes
  are present to help make unsafe DLA safer.

 ---------------------
: Performance Options :
 ---------------------

ARMCI_CACHE_RANK_TRANSLATION (boolean)

  Create a table to more quickly translate between absolute and group ranks.

 --------------------------
: Noncollective Groups     :
 --------------------------

ARMCI_NONCOLLECTIVE_GROUPS (boolean)

  Enable noncollective ARMCI group formation; group creation is collective on
  the output group rather than the parent group.

 --------------------------
: Shared Buffer Protection :
 --------------------------

ARMCI_SHR_BUF_METHOD = { COPY (default), NOGUARD }

  ARMCI policy for managing shared origin buffers in communication operations:
  lock the buffer (unsafe, but fast), copy the buffer (safe), or don't guard
  the buffer - assume that the system is cache coherent and MPI supports
  unlocked load/store.

 --------------------
: I/O Vector Options :
 --------------------

ARMCI_IOV_METHOD = { AUTO (default), CONSRV, BATCHED, DIRECT }

  Select the IO vector communication strategy: automatic; a "conservative"
  implementation that does lock/unlock around each operation; an implementation
  that issues batches of operations within a single lock/unlock epoch; and a
  direct implementation that generates datatypes for the origin and target and
  issues a single operation using them.

ARMCI_IOV_CHECKS (boolean)

  Enable (expensive) IOV safety/debugging checks (not recommended for
  performance runs).

ARMCI_IOV_BATCHED_LIMIT = { 0 (default), 1, ... }

  Set the maximum number of one-sided operations per epoch for the BATCHED IOV
  method.  Zero (default) is unlimited.
  
 -----------------
: Strided Options :
 -----------------

ARMCI_STRIDED_METHOD = { DIRECT (default), IOV }

  Select the method for processing strided operations.
minsii / armci-mpi

About

Languages