Parallel N-Body Simulation with Barnes-Hut Approximation

The N-body problem is fundamental to many scientific fields and is concerned with the simulation of a system of particles under the influence of forces over time. Performing such simulations involving a large number of particles, typical in many real world problems, requires enormous computational time. This can however be significantly sped up by parallelizing the problem and using approximation algorithms to reduce the number of computations involved, making such simulations viable for practical purposes.

This repository contains code that demonstrates the effectiveness of parallelization using MPI and the Barnes-Hut algorithm as an approximation technique on the N-body problem.

Introduction

The N-body problem is fundamental to many scientific fields of study and is concerned with the simulation of a system of particles, under the influence of forces over time. The nature of these forces depends on the problem being solved with typical examples being gravitational forces experienced by a system of astronomical bodies, or the electrostatic Coulomb forces experienced by electrically charged particles such as protons and electrons in an atom, or atoms in a in molecule. Many real world phenomena are modeled and simulated as N-body problems, with the former being an example of its application in astrophysics for simulating planetary systems or stars in a galaxy, and the latter in molecular dynamics for problems such as protein folding. Other interesting applications include vortex flows in fluid dynamics and radiosity computation in computer graphics.

While solutions for 1-body problems are trivial, 2-body problems can be expressed as differential equations and solved using Kepler’s laws of motion. For problems involving three or more bodies, the solutions are either too complex or no known solution exists. As a result such systems are modeled and studied using computational methods and simulations. Simulations carried out on systems with large number of bodies over long periods of time involve a huge number of repetitive calculations. These simulations must also be completed within reasonable time frames for the results to be useful. This paper is concerned with the achievable speed ups on the N-body problem. It begins with a more formal definition of the N-body gravitational problem followed by an analysis of its sequential complexity. Various techniques for improving this complexity are discussed next, with the main focus being on the Barnes-Hut algorithm. This is followed by a discussion on parallelizing the N-body problem. The various experiments carried out are presented and the results are discussed in length.

Defining the Problem

The N-body gravitational problem begins with a set of N particles with initial positions P_i = (x_i , y_i , z_i) in three dimensional space, where i ∈ {1, 2, 3, ..., N }, moving with initial velocities V_i = (v_i^x, v_i^y, v_i^z). Each of these particles are under the influence of a gravitational force F_i = (f_i^x, f_i^y, f_i^z) due to their masses m i , according to Newton’s inverse-square law of gravitation. The component of the force in the x direction is given by:

f_i^x = GΣ_j≠i ((m_im_j) / (d²(i,j))) ((x_ix_j) / (d(i,j)))

where G is the gravitational constant and d is the distance between particles i and j. Similar components of force apply in the y and z directions. The acceleration on a particle, A_i = (a_i^x, a_i^y, a_i^z), as a result of the force experienced in the x direction is given by:

a_i^x = f_i^x / m_i

This results in a new velocity v_i^x' after time δt:

v_i^x' = v_i^x + a_i^x δt

The N-body simulation then continues for a predetermined time period T , integrating with δt discrete time steps. In order to achieve accurate results, the time intervals between each integration step must be short.

A Naive Algorithm

A straight forward algorithm for the N-body problem involves computing the forces experienced by each of the N bodies as a result of the gravitational attraction influenced by the other N-1 bodies, for each time step. The complexity of this algorithm for one iteration is Θ(N²). Further, since there are N bodies in the system, the spatial (memory) complexity of the algorithm is Θ(N). For practical applications however, where N is large, the quadratic time complexity of the algorithm quickly becomes intractable and better approaches are required.

for (i ← 0 to N ) do
  for (j ← 0 to N ) do
    if (i == j) then
      continue;
    end if
    compute_force_on_i();
  end for
  compute_velocity_on_i();
  update_position_on_i();
end for

Parallelizing the N-Body Problem

Parallelizing the N-body simulation on a cluster can give significant speedups, depending on the number of processors and interconnection network used. An intuitive approach to parallelization is to partition the problem into P groups, where P is the number of processors, each processor being responsible for N/P particles. Each processor is also responsible for computing the forces experienced by the particles in its group and updating their positions. At the end of each iteration, the processors exchange the new positions of their particles with other processors, so that all processors now have complete updated view of the system. While parallel execution can help speedup the simulation, this does not improve the complexity of the algorithm. For large problem sizes this can still be a problem.

Improving time complexity

The algorithm described so far is a simple particle-particle approach where the mutual forces between every pair of particles is the system is computed. A number of other interesting simulation algorithms for N-body simulations exists that can give an improved time complexity. Some of the popular methods use the divide and conquer approach where the system is spatially decomposed into a hierarchical cluster in the form of a tree, with each node describing the mass and cubic volume of a cluster in the system. These algorithms are also referred to as hierarchical or tree (code) methods. Two widely used hierarchical algorithms in real world applications are the Barnes-Hut algorithm and Greengard’s Fast Multipole Method.

Both these algorithms give significant improvements, with Barnes-Hut resulting in O(Nlog₂N) time complexity and the Fast Multipole Method (FMM) in O(N) complexity for uniform distributions. The efficient implementation of the FMM however is significantly more complex for a three dimensional system and in general requires a larger number of iterations before the algorithm can show comparatively better speedups. The code presented here is primarily concerned with achievable speedups through parallelization, and will thus focus on the Barnes-Hut algorithm for its simplicity, effectiveness in reducing the overall running time and its wider usage in real world applications.

The Barnes-Hut algorithm

The use of the Barnes-Hut algorithm for force calculation in the N-body problem proceeds in two phases. The first phase involves constructing a hierarchical octtree representing the three dimensional space through recursive subdivision of the root cell (containing all particles in the system) into eight cubic subcells of equal size, and then progressing recursively on each of the subcells until all subcells contain at most one particle. The leaves in the tree represent cells containing one particle, while the parent nodes contain the total mass and center of mass of the subtree below it.

The second phase involves computing the forces experienced by each of the particles by traversing the tree. This phase makes use of the fact that the interaction between a particle and a distant cluster of particles can be approximated to a single combined force between the particle and the cluster as a whole. The calculation is then reduced to using the total mass and center of mass of the cluster.

The algorithm for force calculation on each particle proceeds by starting at the root cell. If the distance between the particle and the center of mass of the cluster is D, and the length of the cell is l, the interaction between the particle and the cell (cluster) as a whole is calculated if l/D < θ, where θ is a constant accuracy parameter, typically ≤ 1.0. If not, the current cell is resolved into its eight subcells, and each of them is examined recursively.

Once the computation of forces experienced by all particles is complete, the N-body algorithm proceeds to compute the new velocities and positions, discarding the tree. A new tree is then regenerated for the next iteration and all subsequent iterations. In the parallel N-body algorithm described in this paper, where each processor is responsible for N/P particles, the tree for the entire system is generated on each processor. The interactions however are only computed for the particles belonging to each of the processors. The construction of the tree requires time O(Nlog₂N). Similar time is taken for computing all the forces using the tree, depending on the value of θ. The overall running time complexity of using the Barnes-Hut algorithm is thus O(Nlog₂N).

Strategies for optimal solutions

This project examines the use of the popular Barnes-Hut algorithm as a feasible alternative to the particle-particle method for large values of N , by generating the octtree for the entire system on each of the processors in parallel, for every iteration. One of the problems with this approach is that an uneven distribution of particles in the system will result in an unbalanced tree, and thus an imbalance in the load distribution with some of the processors performing more calculations that others. Depending on the particle distribution this can have a significant impact on the overall running time. However, it is possible to further optimize the parallel computation time by employing a number of other strategies.

A well known approach described by J. Salmon uses the method of orthogonal recursive bisection (ORB) to recursively decompose the system into sub-volumes similar to the approach described above. However the strategy for decomposition is such that an equal number of particles lie on either side of the division. The decomposition into sub-volumes continues for as many levels as required, producing a well balanced tree. Each of the processors are then assigned one or more sub-volumes, balancing the computational load on the cluster.

A further optimization can be made by using the fact that particles move very small distances in every iteration (for small values of δt). Thus by not discarding the entire tree in every iteration, except regenerating those parts of the tree whose corresponding particles have moved outside the cell, a significant computation time can be saved when N is large.

A third optimization involves parallelizing the tree generation such that each of the node generates a partial tree contain particles belonging to it. Once generated, the nodes in the cluster exchange their partial trees with each other, resulting in each of the nodes having a complete tree representing the system.

Spatially bounded and unbounded systems

Simulations of N-body problems may assume a closed system where the particles move within the boundaries of a spatially bounded system. Alternatively, the system can be unbounded allowing particles to move continually further apart. While the former is more useful in fields such as molecular and fluid dynamics, the later better reflects simulations on celestial bodies. Closed systems may need to make certain assumptions on the collision of a particle with a boundary. This paper examines the more interesting case of spatially bounded systems, and assumes that collisions are elastic (i.e. kinetic energy is conserved). Using the law of conservation of momentum, a collision with the Boundary_x results in:

m_iv_i^x' = m_iv_i^x

where v_i^x' is the velocity of the particle in the opposite direction.

Particle collisions and 3D collision detection

While simulations on stellar systems can be assumed to be collision-less, collisions are an important phenomenon and often modeled in molecular and fluid dynamics. The simulation described in this paper is a collision-based system for two reasons. First, many real world simulations, including stellar systems, model the effects of collisions. Secondly, collisions on particles with a radius can help prevent the interaction forces from reaching unreasonably high numbers, resulting in very high particle velocities.

The algorithm for collision detection takes time O(N²), where each particle is checked for possile intersection with every other particle in the system. Although a brute force approach to collision detection is described here, it is relatively easy to use the octtree from the Barnes-Hut algorithm to significantly reduce the number of particles that must be checked for collisions.

Two particles P₁ and P₂ intersect if:

(P₂^x − P₁^x)² + (P₂^y − P₁^y )² + (P₂^z − P₁^z) 2 < (R₁ + R₂)²

for (i ← 0 to N − 1) do
  for (j ← i + 1 to N ) do
    check_collision();
  end for
end for

Parallel Platform

The parallel computing platform used for the simulation is a distributed memory system, where processors communicate using message passing. Two widely used message passing libraries for parallel programming are MPI (Message Passing Interface) and PVM (Parallel Virtual Machine). The solution described in this paper is based on MPI. While both libraries are suitable for writing portable parallel programs (and support heterogeneous systems), the choice was largely based on general familiarity, simplicity, a number of readily available implementations.

The parallel programming model used is Single Program Multiple Data (SPMD), where all processors execute the same N-body simulation program, with a distinction being made between the master (root) process, which coordinates the distribution of data for computation, with slave processes that (including the master) executes the simulation on the designated set of particles. These distinctions are made using conditional checks within the program.

ronojoy / nbody-simulation