trungtle/Project1-CUDA-Flocking

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 1 - Flocking

Trung Le
Windows 10 Home, i7-4790 CPU @ 3.60GHz 12GB, GTX 980 Ti (Personal desktop)

Flocking

Description

This project implemented three different algorithms for flocking behavior in CUDA: brute-force, uniform grid, coherent uniform grid.

---- General information for CUDA device ----

Device name: GeForce GTX 980 Ti
Compute capability: 5.2
Compute mode: Default
Clock rate: 1076000
Integrated: 0
Device copy overlap: Enabled
Kernel execution timeout: Enabled

---- Memory information for CUDA device ----

Total global memory: 6442450944
Total constant memory: 65536
Multiprocessor count: 22
Shared memory per multiprocessor: 98304
Registers per multiprocessor: 65536
Max threads per multiprocessor: 2048
Max grid dimensions: [2147483647, 65535, 65535]
Max threads per block: 1024
Max registers per block: 65536
Max thread dimensions: [1024, 1024, 64]
Threads per block: 512

Performance analysis

To run performance analysis, I used CUDA to wrap around the simulation step and turned off visualization.

The boid count tested are: 1000, 5000, 50000, 100000, 150000 The block size tested are: 1, 128, 512 (maxed at 512 threads per block on my machine)

Table showing simulation time per frame (in ms) vs. number of boids). Block size is 128.

Boid count	Coherent	Uniform	Naive
5000	0.2	0.2	4.3
50000	1.2	1.5	166
100000	1.8	2.6	654
500000	12.5	20.6	(crash)
1000000	52	74	(crash)

For each implmentation, as the number of boids increases, there is a big drop in performance. This is due to the fact that for each boid's velocity computation, we need loop through a list of potential neighbors that can affect the boid. The biggest optimization made in each implementation is a improvement on how efficient we can iterate through these neighbors by partitioning them into a uniform grid and rearrange their data in a coherent memory.

Changing the block size doesn't seem to increases performace. However, I did find an interesting pattern. At every block size that is a power of 2, the performance is optimal. I tested this with 1,000,000 in coherent grid (~46ms max) and scattered uniform grid (~74ms max).

Table showing block size vs. simulation time per frame (in ms)

Block size	Coherent	Uniform
32	45	74
35	74	113
64	44	73
100	54	87
128	46	74
256	45	73
400	54	85
512	46	73
600	68	105
1024	46	72

When comparing between coherent and scattered uniform grids, I did see a great performance improvement with the coherent data implementation as the number of boids increases. From my performance analysis, I observed that at 1000000 boids and kernel block size 128, coherent uniform grid is ~20ms faster than scattered uniform grid. This is a big improvement! This is due to the fact that we take advantage of spatial coherence can access each neighboring boid's data sequentially in memory.

Note: I also updated the -arch=sm_52 version to make it compatible for my machine

trungtle / Project1-CUDA-Flocking

Flocking

Description

Performance analysis

About

Languages