ARMCI on MPI-RMA Implementation Notes James Dinan <dinan@mcs.anl.gov> =============================================================================== Introduction =============================================================================== This project provides a full, high performance, portable implementation of the ARMCI runtime system using MPI's remote memory access (RMA) functionality. =============================================================================== Installing Only ARMCI-MPI =============================================================================== ARMCI-MPI uses autoconf and must be configured before compiling: $ ./configure Many configure options are provided, run "configure --help" for details. After configuring the source tree, the code can be built and installed by running: $ make && make install The quality of MPI-RMA implementations varies. As of August, 2011 the following MPI implementations are known to work correctly with ARMCI-MPI: + MVAPICH2 1.6 + MPICH2 + Cray MPI on Cray XE6 + IBM MPI on BG/P (set ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED) + OpenMPI 1.5.4 (set ARMCI_STRIDED_METHOD=IOV and ARMCI_IOV_METHOD=BATCHED) The following MPI implementations are known to fail with ARMCI-MPI: - MVAPICH2 prior to 1.6 =============================================================================== Installing Global Arrays with ARMCI-MPI =============================================================================== ARMCI-MPI has been tested with GA 5.0.2. To build GA with ARMCI-MPI, rename this directory to "armci" and substitute it for the "armci" directory in the GA distribution. Configure and build GA as usual; no special flags are required. =============================================================================== The ARMCI-MPI Test Suite =============================================================================== ARMCI-MPI includes a set of testing and benchmark programs located under tests/ and benchmarks/. These programs can be compiled and run via: $ make check MPIEXEC="mpiexec -n 4" The MPIEXEC variable is optional and is used to override the default MPI launch command. If you want only to build the test suite, the following target can be used: $ make checkprogs =============================================================================== ARMCI-MPI Errata =============================================================================== Direct access to local buffers: * Because of MPI's semantics, you are not allowed to access shared memory directly, it must be through put/get. Alternatively you can use the new ARMCI_Access_begin/end() functions. Progress semantics: * On some MPI implementations and networks you may need to enable implicit progress. In many cases this is done through an environment variable. For MPICH2: set MPICH_ASYNC_PROGRESS; for MVAPICH2 recompile with --enable-async-progress and set MPICH_ASYNC_PROGRESS; set DCMF_INTERRUPTS=1 for MPICH2-BG; etc. =============================================================================== Environment Variables: =============================================================================== Boolean environment variables are enabled when set to a value beginning with 't', 'T', 'y', 'Y', or '1'; any other value is interpreted as false. ------------------- : Debugging Options : ------------------- ARMCI_VERBOSE (boolean) Enable extra status output from ARMCI-MPI. ARMCI_DEBUG_ALLOC (boolean) Turn on extra shared allocation debugging. ARMCI_FLUSH_BARRIERS (boolean) Enable/disable extra communication flushing in ARMCI_Barrier. Extra flushes are present to help make unsafe DLA safer. --------------------- : Performance Options : --------------------- ARMCI_CACHE_RANK_TRANSLATION (boolean) Create a table to more quickly translate between absolute and group ranks. -------------------------- : Noncollective Groups : -------------------------- ARMCI_NONCOLLECTIVE_GROUPS (boolean) Enable noncollective ARMCI group formation; group creation is collective on the output group rather than the parent group. -------------------------- : Shared Buffer Protection : -------------------------- ARMCI_SHR_BUF_METHOD = { COPY (default), NOGUARD } ARMCI policy for managing shared origin buffers in communication operations: lock the buffer (unsafe, but fast), copy the buffer (safe), or don't guard the buffer - assume that the system is cache coherent and MPI supports unlocked load/store. -------------------- : I/O Vector Options : -------------------- ARMCI_IOV_METHOD = { AUTO (default), CONSRV, BATCHED, DIRECT } Select the IO vector communication strategy: automatic; a "conservative" implementation that does lock/unlock around each operation; an implementation that issues batches of operations within a single lock/unlock epoch; and a direct implementation that generates datatypes for the origin and target and issues a single operation using them. ARMCI_IOV_CHECKS (boolean) Enable (expensive) IOV safety/debugging checks (not recommended for performance runs). ARMCI_IOV_BATCHED_LIMIT = { 0 (default), 1, ... } Set the maximum number of one-sided operations per epoch for the BATCHED IOV method. Zero (default) is unlimited. ----------------- : Strided Options : ----------------- ARMCI_STRIDED_METHOD = { DIRECT (default), IOV } Select the method for processing strided operations.