kocatepedogu / cellular-automaton-processor

A simple processor with a grid of cores that can only interact with their immediate neighbors

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Cellular Automaton Processor

The Cellular Automaton Processor is a very simple application-specific processor designed for tasks that can be modelled as cellular automata. It consists of a large number of cores that can only interact with their immediate neighbors. The capabilities of individual cores are limited to simple math and logic operations. Even though cellular automata can be Turing-complete, the processor is not intended to be used for general purpose computation. It is suitable for numerical solution of partial differential equations and various cellular automaton based programs.

Examples


Two-dimensional wave equation solution.

The simulation starts with two close wave sources that produce interfering waves. 32 bits was the minimum register length that can produce numerically stable results (with 16 bits for the integer part and 16 bits for fractional part). Timestep was choosen to be 1/128.


Two-dimensional heat equation solution.

Initially, there was a circle with high temperature surrounded by a very cold environment. The screenshot was taken a few seconds after the beginning. It is possible to run the simulation with a register size of 16 bits with 8 bits for the fractional part. Even 12 bits registers can produce reasonable looking results, but it becomes inaccurate after a few minutes.


Glider pattern in Conway's Game of Life. The space is a toroidal array, so the pattern repeats itself forever.

This is the simplest example and can be run with 6 bit registers in a 25x25 grid. It also works with 4 bit registers if the grid size is reduced to 8x8. This is currently the only one that is tested on an actual FPGA (Basys3 with F4PGA). The total number of cores in the simulations are 25x25=625.

The above screenshots are taken from Verilator simulations. The assembly language sources of the example programs can be found in the examples directory. C++ and Verilog sources for the simulations are available under targets/verilator.

Running Examples

A C++ compiler, Python 3 and Verilator are needed.

To compile Verilator simulation:

make simulation

To run examples:

./obj_dir/Vtop game-of-life
./obj_dir/Vtop wave-equation
./obj_dir/Vtop heat-equation
./obj_dir/Vtop recursion

Instruction Format (16 Bits)

R-Type

Opcode Target Register First Register Second Register
4 Bits 4 Bits 4 Bits 4 Bits

I-Type

Opcode Target Register Immediate
4 Bits 4 Bits 8 Bits

B-Type

Opcode Condition Register Relative Branch Address (Word)
4 Bits 4 Bits 8 Bits

J-Type

Opcode Jump Address (Word)
4 Bits 12 Bits

E-Type

Opcode Function Code First Register Second Register
4 Bits 4 Bits 4 Bits 4 Bits

Registers

Value Name Description
0 rs Neighbor cores can read this through x-, x+, y-, y+
1 r1 General purpose register
2 r2 General purpose register
3 r3 General purpose register
4 r4 General purpose register
5 r5 General purpose register
6 r6 General purpose register
7 r7 General purpose register
8 r8 General purpose register


These are general purpose registers often used for math and logic operations.

Value Name Description
9 zero Reading always yields zero
10 x x coordinate of the core itself
11 y y coordinate of the core itself
12 x- State of the core on the left
13 x+ State of the core on the right
14 y- State of the core below
15 y+ State of the core above


The registers in the second table behave differently from the previous ones. The registers 'zero', 'x', 'y', 'x-', 'x+', 'y-' and 'y+' are read-only, and trying to write to them is undefined behavior.

Value Name Description
9 video Writing displays value
10 precision Number of bits for the fractional part


The registers 'video' and 'precision' are write-only, and trying to read their existing values is undefined behavior.

Writing to the 'video' register displays the given value at the x and y coordinates of the core.

Writing to the 'precision' register sets the number of bits used for storing fractional part of real numbers in the fixed point format. The instructions that operate on fixed point numbers (fmul, fix, unfix) use this internally stored register in computations.

Design

There is a single address bus and a single data bus for the instruction memory, so only one instruction can be fetched in a single cycle. To achieve parallel execution, all cores must execute the same instruction at the same time. This is similar to GPUs, but the entire grid of cores behaves as a single warp/wavefront. No actual branching is possible at all. To provide branch ability when it is absolutely needed, both if and else portions of conditional statements are executed, and some cores stay inactivated, which leads to performance loss. In short, conditional branches that depend on cell indices should be avoided whenever possible, but they are still supported. However, there is no branch penalty when all cores follow the same path since there is no pipelining.

Random access data memory does not exist. The entire data memory is limited to the registers in the cores, and a core can only access its own registers and the shared registers of its immediate neighbors in a single cycle. If a particular core needs more memory, it is possible to access registers of the distant cores in multiple cycles. The cost of accessing memory increases linearly with distance. However, this method also keeps the processor busy with shifting grid contents rather than doing computation. Adding a large data RAM to the design is still not preferred as it would be more like a GPU rather than a cellular automaton processor.

A small call stack memory is present in the global control unit. This allows programs to use function calls and recursion when needed. The current maximum recursion depth is 32 calls. Programs cannot directly read or write to this stack memory through the ISA. It cannot be used to store function arguments or local varibles.

The grid of cores is a toroidal array (for example, the leftmost cores read the rightmost cores' values when they access x-). This is useful for two reasons: First, the example programs (especially PDEs) can work with this arrangement without any special care for boundaries, which would otherwise require linear extrapolation and branch divergence. Second, it is possible to use the processor for computing a 1D grid with this design. The only operation needed is to shift the grid down once, and then access x- in the first cell of each row. This yields the last value of the above row, connecting all rows in a one-dimensional way.

Instruction Set Architecture

There are currently 17 instructions. The maximum number of different opcodes is limited to 16, so the last opcode value 15 is used for using extension instructions (E-type) together with the function code.

Opcode Mnemonic Name Type Example
0 li Load Immediate I li r1,100
1 unl Unless B unl r1,elseLabel
2 add Add R add r1,r2,r3
3 sub Subtract R sub r1,r2,r3
4 and Logical AND R and r1,r2,r3
5 or Logical OR R or r1,r2,r3
6 nor Logical NOR R nor r1,r2,r3
7 seq Set if equal R seq r1,r2,zero
8 slt Set if less than R slt r1,r2,r3
9 mul Multiply R mul r1,r1,r2
10 shr Arithmetic Right Shift R shr r1,r1,r2
11 fmul Fixed point multiply R fmul r1,r2,r3
12 j Jump J j label
13 call Function call J call func
14 ret Function return J ret 0
15 - Extension E -

Extension Instructions

Function Code Mnemonic Name Example
0 fix Convert decimal to fixed point fix r1,r2
1 unfix Convert fixed point to decimal unfix r1,r2

Most instructions are trivial. There are four instructions that require some explanation.

The 'unl' (unless) instruction is used for conditional branches. It is a B-type instruction that jumps to the given label if the condition register is zero. It is intended to be used for implementing if statements in high level languages. For example the following code

video[threadIdx.y][threadIdx.x] = 0;

int cond1 = (threadIdx.x == 3);
int cond2 = (threadIdx.y == 4);
int condition = cond1 && cond2;
if (condition) {
  video[threadIdx.y][threadIdx.x] = 100;
}

would be compiled into

li video,0       # video[threadIdx.y][threadIdx.x] = 0;

li r1,3
seq r1,x,r1      # int cond1 = (threadIdx.x == 3);

li r2,4 
seq r2,y,r2      # int cond2 = (threadIdx.y == 4);

and r1,r1,r2     # int condition = cond1 && cond2;
unl r1,else      # if (condition) {
  li video,100   #   video[threadIdx.y][threadIdx.x] = 100;
else:            # }

Since r1 is true only for the core (x=3, y=4), all other cores are going to wait until it executes the instruction in the if body. Then the cores will continue executing rest of the instructions in the program. If r1 was false for all cores, no wait would occur, and the program execution would directly continue at 'else' label.

The 'fmul' instruction interprets values given in the source registers as fixed point numbers, multiplies them and writes the product to the target register in the same fixed point format. It respects the value of the internal 'precision' register which determines the number of bits used for the fractional part.

The 'fix' instructions converts the integer value given in the second register to a fixed point value and writes the result to the first register. It uses the value of the internal 'precision' register.

The 'unfix' instruction converts the fixed point value in the second register to an integer value and write the result to the first register. The fractional part is lost, so executing 'fix' after 'unfix' may not produce the same result. This instruction also uses the value of the internal 'precision' register.

LICENSE

The assembler source and example programs are released under GNU General Public License v3.0 or any later version.

The HDL and C++ sources are released under the Strongly Reciprocal CERN Open Hardware License Version 2 (CERN-OHL-S-2.0).

About

A simple processor with a grid of cores that can only interact with their immediate neighbors


Languages

Language:SystemVerilog 48.6%Language:C++ 32.1%Language:Python 16.9%Language:Makefile 2.3%