The Game of Life, also known simply as Life, is a cellular automaton devised by the British mathematician John Horton Conway in 1970. The game is a zero-player game, meaning that its evolution is determined by its initial state, requiring no further input. One interacts with the Game of Life by creating an initial configuration and observing how it evolves, or, for advanced players, by creating patterns with particular properties.
The universe of the Game of Life is an infinite, two-dimensional orthogonal grid of square cells, each of which is in one of two possible states, alive or dead. Every cell interacts with its eight neighbours, which are the cells that are horizontally, vertically, or diagonally adjacent. At each step in time, the following transitions occur:
- Any live cell with fewer than two live neighbors dies, as if by underpopulation.
- Any live cell with two or three live neighbors lives on to the next generation.
- Any live cell with more than three live neighbors dies, as if by overpopulation.
- Any dead cell with exactly three live neighbors becomes a live cell, as if by reproduction.
The initial pattern constitutes the seed of the system. The first generation is created by applying the above rules simultaneously to every cell in the seed; births and deaths occur simultaneously, and the discrete moment at which this happens is sometimes called a tick. Each generation is a pure function of the preceding one. The rules continue to be applied repeatedly to create further generations.
Many different types of patterns occur in the Game of Life, which are classified according to their behaviour. In this implementation were used the following patterns:
the Gosper glider gun produces its first glider on the 15th generation, and another glider every 30th generation from then on.
Diehard is a pattern that eventually disappears, rather than stabilizing, after 130 generations, which is conjectured to be maximal for patterns with seven or fewer cells.
Acorn takes 5206 generations to generate 633 cells, including 13 escaped gliders.
Compile project:
cd gameoflife_opencl
make
Run game of life with specific options:
./main seed rows cols generations lws [i] [p]
Seed option specific which configuration use:
- g - Gosper
- a - Acorn
- d - DieHard
lws option must be considerated as lws^2^, i option move around a bug found using Intel Graphics Gen6 graphic card and p option disable grid visualitation.
This implementation of the Game of life is written using ANSI C for the host(CPU) code and OpenCL for the device(GPU) code. Four kernels have been implemented:
- init kernel for initialization of grid with specified seed.
- where_expand kernel calculates on which sides the grid will have to expand.
- expand kernel creates a new grid where sides indicated by where_expand are expanded.
- generation kernel is the core of the project and execute rules for forward the generation.
The where_expand kernel examines the matrix along the edges and checks if, for each of its sides, there is at least one automaton, and writes the result of the scan into a buffer. This buffer is passed by parameter to the expand kernel which will initiate a new matrix by adding the rows and columns needed for expansion. The values in the matrix will be the same as in the previous while the cells in the additional rows and columns will be initialized with the value 0.
To make the expansion procedure more optimized, the clEnqueueCopyBufferRect method was used, which initializes the whole new matrix with zero values and then, by choosing the sub-array of the just initialized matrix, writes the values of the old, smaller matrix inside of the submatrix. However, the clEnqueueCopyBufferRect method presents a bug when allocating memory on the Intel HD Graphics 520 graphics card and has the segmentation fault. To work around this problem, the option i has been made available which initializes the new matrix using a method that writes the values within the new matrix one by one instead of using the clEnqueueCopyBufferRect method.
This is the core of the algorithm, as it computes the generation change.
The automaton works locally for each cell, analyzing it and its 8 neighbors and deciding for each step of the loop to which state it should to hire.
There are 2 different versions of the automaton, each one being more suited for different hardware:
- A global memory implementation, without any particular optimizations (target: newer GPUs with hardware caching, devices without a local memory like CPUs)
- A local memory caching implementation, theoretically more optimized (target: older GPUs without hardware caching)
The performance tests were performed on two completely different graphics cards: the Nvidia GT940M and the Intel Graphic Card 520. The tests were done with a 2000x2000 matrix and the method to calculate throughput is:
//Generation Kernel
(10.0*memsize)/runtime_ns(generation_evt)
//Expand Kernel
(2.0*memsize)/runtime_ns(generation_evt)
//Where Expand Kernel
(2.0*memsize)/runtime_ns(generation_evt)
Both video cards have pros and cons; The NVIDIA card is better performing with the generation Kernel (using local memory makes it better with a high lws) and with the expansion Kernel using the clEnqueueCopyBufferRect method while the Intel board is more performing with the Kernel Where Expand and with the expansion kernel that uses the method implemented to avoid the bug. As for throughput, the NVIDIA card is less efficient than the Intel on all Kernels except the Expansion Kernel that uses the clEnqueueCopyBufferRect method. Using the local memory the Generation Kernel is better performing with lws 2 or using the default lws.
LWS | Nvidia | Intel |
---|---|---|
No LWS | 3ms | 15ms |
2 | 5ms | 7ms |
4 | 7ms | 6ms |
6 | 5ms | 13ms |
8 | 3ms | 6ms |
16 | 3ms | 8ms |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 15GB/s | 33GB/s |
2 | 9GB/s | 29GB/s |
4 | 9GB/s | 32GB/s |
6 | 9GB/s | 32GB/s |
8 | 20GB/s | 29GB/s |
16 | 12GB/s | 33GB/s |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 3ms | 15ms |
2 | 39ms | 39ms |
4 | 10ms | 11ms |
6 | 5ms | 8ms |
8 | 3ms | 7ms |
16 | 3ms | 5ms |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 3GB/s | 3GB/s |
2 | 3GB/s | 4GB/s |
4 | 9GB/s | 14GB/s |
6 | 15GB/s | 24GB/s |
8 | 18GB/s | 39GB/s |
16 | 18GB/s | 39GB/s |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 2.9ms | 0.4ms |
2 | 3ms | 0.4ms |
4 | 3ms | 0.4ms |
6 | 2.8ms | 0.4ms |
8 | 2.6ms | 0.4ms |
16 | 2.9ms | 0.4ms |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 10GB/s | 65GB/s |
2 | 11GB/s | 65GB/s |
4 | 11GB/s | 64GB/s |
6 | 11GB/s | 65GB/s |
8 | 10GB/s | 65GB/s |
16 | 10GB/s | 64GB/s |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 18ms | 3ms |
2 | 2ms | 3.8ms |
4 | 1.7ms | 3.5ms |
6 | 2.1ms | 3.5ms |
8 | 3ms | 3.7ms |
16 | 2.7ms | 3.7ms |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 17GB/s | 8GB/s |
2 | 16GB/s | 8GB/s |
4 | 17GB/s | 9GB/s |
6 | 15GB/s | 9GB/s |
8 | 17GB/s | 8GB/s |
16 | 16GB/s | 8GB/s |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 3ms | 1.7ms |
2 | 3.2ms | 1.9ms |
4 | 3ms | 2ms |
6 | 3.2ms | 1.8ms |
8 | 3.5ms | 1.8ms |
16 | 3ms | 1.9ms |
LWS | Nvidia | Intel |
---|---|---|
No LWS | 10GB/s | 17GB/s |
2 | 11GB/s | 16GB/s |
4 | 10GB/s | 16GB/s |
6 | 10GB/s | 17GB/s |
8 | 19GB/s | 17GB/s |
16 | 10GB/s | 16GB/s |