material-point-method gpu gpu-acceleration particle-in-cell gpgpu gpgpu-computing gpgpu-sim simulation particle grid

Further Development of https://github.com/MeyerFabian/snow focusing on performance optimization on the GPU.

The code is very hard to read due to a lot of different tests and preprocessor commands. I am focusing on improving it at the moment. Written text of my master thesis can be found here: Thesis.

Presentation can be found at: Pres.

Video of my BA: Youtube.

Overview

Implemented the MPM-Transfers using OpenGL Compute for physically based simulations of continuum material.
Tested PIC as well as APIC transfers: Jiang et al.,2016, Jiang et al. 2015
Designed a shader generator for OpenGL to allow for various permutations of GPGPU compute programs.
Enforced Test-driven development to monitor numerical precision and performance metrics using NVIDIA Nsight & OpenGL Timer queries.
Implemented SVD from McAdams et al., 2011.
Tested out different data formats (SoA vs. AoS) using reflection of magic_get. (Would recommend reflection macros though.)
Applied preprocessing in form of binning & counting sort to increase coalescing & caching behaviors. See, Rama C. Hoetzlein, Fast Fixed-Radius Nearest Neighbors.
Applied preprocessing of stream compaction of active cell regions.
Tested out batching which batches particles in fixed size groups and accumulates their data at once. Trade-off is register pressure.
Accelerated governing transfers by utilizing the shared memory architecture leading to order-independence of data and up to 10x speedup over a naive GPU implementation.

Comparison

Also take a look at GPUMPM which was simultaneously beeing developed and additionally uses warp operations to further speedup the Material Point Method.

	This	Gao 2018 et al.
Sort	Count/Histogram for each var.	Count/Histogram sel. variables
Filtering domain	Filter-operation	Sparse Voxel Grid structure
Transfers	Shared mem. only	Warp-shuffle operations

Performance

Method	μs	Speedup	VRAM	L2	SM
global	44,442	-	4.6%	34.4%	7.7%
snow	45,342	0.98x	25.1%	42.4%	11.7%
snow sorted	23,007	1.97x	43.8%	59.0%	23.9%
global sorted	20,484	2.21x	7.0%	44.0%	16.1%
P2G-pull	4,747	9.55x	3.7%	6.7%	39.4%
P2G-atomic*	3,148	14.40x	5.3%	6.7%	65.0%
P2G-sync*	2,595	17.47x	5.9%	7.6%	67.0%

P2G-transfers of one million uniformly positioned particles with random velocities between between [-1.0;1.0] in a 128x 128x128 grid. They form a rotated (unsorted) cube with four particles per cell. Block size is (8,4,4). Methods marked with a star(*) are executed with batching = 4.

Abstract

The material point method is allowing for physically based simulations. It has found its way into computer graphics and since then rapidly expanded. The material point method’s hybrid use of Lagrangian particles as a persistent storage and a background uniform Eulerian grid enables solving of various partial differential equations with ease.

The material point method suffers from high execution times and is thus only viable for hero shots. The method is however highly parallelizable. Thus, this thesis proposes how to accelerate the material point method using GPGPU techniques. Core of the material point method are grid and particles transfers that interpolate between the two structures. These transfers are executed multiple times per physical time step. Preprocessing steps might be taken if their computing time is outweighed.

Deep sorting with counting sort increases coalescing and L2 cache hit rates. Binning allows to divide the grid into blocks for shared memory filtering techniques. All operations do not rely on fixed bin size. As another preprocessing step, only grid blocks are executed which have particles in them.

Project

Ready for Windows: VS (2017 tested), NMake(compile_commands activated) Theoretically portable to unix-systems (no dependency restrictions)

Dependencies

C++17

GLEW (Tested 2.1.0, build from source)

GLFW (Tested 3.2.1, build from source)

ASSIMP (Tested 4.1.0, build from source)

GLM (Tested GLM 0.9.9.0, Header only)

Compute Shader ready GFX introduced with OpenGL 4.3

Included Dependencies

stb_image

voxelizer (A precomputed voxelization of the Stanford-Bunny is already included in resources/model/)

magic_get

Possible Improvements

BufferDataInterface should rely on composition as opposed to inheritance or go down ecs-route
Tests should rely more on polymorphism
Test out warp operations
Start shared memory G2P-Transfers with threads assigned to particles, threads which correspond to no particle terminate immediately. Use a reduction technique(mapReduce.cpp) to count how many particles are in a grid block. Assign

About

Code of my M.Sc. Thesis "GPU Acceleration of the Material Point Method"

material-point-method gpu gpu-acceleration particle-in-cell gpgpu gpgpu-computing gpgpu-sim simulation particle grid

Languages

Language:C++ 85.0%Language:GLSL 8.2%Language:C 5.5%Language:Common Lisp 0.8%Language:CMake 0.5%Language:Shell 0.0%