AlgEng - Matrix Multiplication

This C++ project is build as a part of the Algorithm Engineering course at Aarhus University. It includes different includes implementations of different alrogithms for multiplying matrices.

Algorithms

The project contains the following algorithms.

Algorithm	Description
Naive	This algorithm is simply three simple for-loops iterating through the two matrices.
Naive:x	This algorithm is equal to Naive but usees OpenMP to parallelize
Naive:T	This algorithm takes advantage of transposing B in A * B before multiplying the two to lower cache misses from $n^3$ to $\frac{n^3}{cache-line-size}$.
Naive:T:x	This algorithm is equal to Naive:T but uses OpenMP to parallelize
Obl	This algorithm uses a cache oblivious approach to lower the cache misses by keeping on recursing on smaller sub problems.
Obl:x	This algorithm uses the same recursion as Obl but stops when the problem size is below a threshold of size x where it switches to the Naive algorihm.
Obl:T:x	This algorithm combines the benefits of Naive:T and Obl:x by transposing B and recursing until some threshold x and then switching to Naive:T.
Tile:x	This algorithm uses a tile based approach where it divides A and B into tiles that fits into cache and then calculates every sub problem using Naive. This algorithm is thus a cache aware algorithm.
Tile:T:x	This algorithm uses the same algorithm as Tile:x but uses Naive:T for the sub problems.

Building

To build the project make a folder called output in the root of the project and run cmake and make to build it. Note this project depends on Intels PCM library as well as PAPI. Both are listed under dependencies.

The build will generate four executables:

testing: Used to run unittests of all the implementations.
datagen: Used to generate data for the benchmark programs.
benchmark: Benchmarks algorithms using the PAPI library
benchmark2: Benchmarks algorithms using the PCM library

Testing

The testing executable takes no arguments and runs all tests in the test folder

> ./testing

Datagen

The datagen executable takes four arguments and generates a datafile for two matrices A of size m x n and B og size n x p filled with random numbers between -1000 and 1000.

> ./datagen <m> <n> <p> <output-file>

Benchmark

The benchmark executable takes different arguments and benchmarks all the algorithms using PAPI.

Arguments:

-r:<refresh>: is an unsigned int telling how many refresh iterations to run before measuring. Default 2.
-l:<loop>: is an unsigned int telling how many iterations to measure and average over. Default 5.
-i:<input-file>: specifies a file generated by datagen to benchmark algorithms against. Multiple files can be specified.
-o:<output-file>: specifies the prefix of the resulting output files.

Example:

> sudo taskset -c 0 nice -n -20 ./benchmark -i:data/42_42_42.data -r:5 -l:10 -o:my_run

Note that all benchmark result data will be located in the output folder in the root of the project.

Benchmark2

The benchmark executable takes different arguments and benchmarks all the algorithms using PCM.

Arguments are the same as benchmark but with an additional

-c:<core>: specifies the core to measure events from (should be the same as set in taskset for linux).

Example:

> sudo taskset -c 0 nice -n -20 ./benchmark -i:data/42_42_42.data -r:5 -l:10 -o:my_run -c:0

Note that all benchmark result data will be located in the output folder in the root of the project.

Extra tools

The project includes print_result_cols.py that can be used to print a single column from the result files in the following way.

Arguments:

<col-idx>: 0 based index of the col to print from every file
<data-file>: One or more files to print data from.

> python print_result_cols.py 0 my_run.nai1.data my_run.obl.160.data

Will print the first column of my_run.nai1.data and my_run.obl.160.data in a table like style.

Dependencies

Catch test framework: Catch is included in the project and should work out of the box.
Intel Performance Counter Monitoring (PCM): For PCM to work it has to be cloned from its repository and build. Afterwards it can be linked in the CMakeLists.txt file in this project. Note the the project is setup to work with Intel® Core™ i7-5600U Processor and not guaranteed to work with others.
Performance API (PAPI): PAPI works only on Linux and it has to be installed such that /usr/local/lib/libpapi.a is present at this exact location. Installation guide can be found on the webpage.

fhvilshoj / alg_eng_matrix