matrix-multiplication intel-intrinsics naive-algorithm strassen-multiplication

Matrix Multiplication Optimization

Project done as part of LLNL HPC Cluster Engineer Academy

This project analysized matrix multiplication algorithms in an HPC environment. Analysis was helped by Intel MSRs, enabled by LLNL msr-safe kernel module and interface. Intel Intrinsics SSE 4.1 instruction libraries were used to help optimize.

Environment

Experiment was developed for and executed on Intel Xeon E5-2670 @ 2.6 Hz. Node had two sockets, each with a cpu with 8 cores and 2 threads per core.

Generating Test Matrices

Standard test matrices are made and written to files by R to be used as inputs and known correct answers for checking correctness of algorithm. To generate test matrices, type:

Rscript generate_test_cases.R 256

where 256 is the size of the matrix

Naive Matrix Multiply

To make matrix multiplication without Intel Intrinsics libraries, type:

make naive

To run matrix multiplication without Intel Intrinsics, type:

make runnaive N=256 THREADS=8

where N is the size of the matrix NxN and THREADS is the number of threads to run with

To make naive algorithm with Intel Intrinsics libraries, type:

make naive_intrinsics

To run Intel Intrinsics naive algorithm, type:

make runnaive_intrinsics N=256 THREADS=8

where N is the size of the matrix NxN and THREADS is the number of threads to run with

To make naive algorithm with single-precision floating point numbers instead of double-precision, type:

make naive_float

To run naive algorithm with single-precision floating points, type:

make runnaive_float N=256 THREADS=8

where N is the size of the matrix NxN and THREADS is the number of threads to run with. This may result in errors due to a lack of precision.

Strassen Matrix Multiply

To make strassen multiplication, type:

make strassen

To run the strassen multiplication algorithm, type:

make runstrassen N=256

where N is the size of the matrix NxN.

Naive Analysis

On the initial unoptimized matrix multiplication, as the matrix size increases, the instructions per cycle decreases dramaticaly

At the same time, as the matrix size increase, the cache misses per instruction increases.

Therefore, it seems like as the matrix size increases, the function becomes increasingly more memory bound

To solve this, read and saved 6 values ahead in order to have faster and less reads to RAM, instead pulling these values into cache. This dramatically speed up the time.

Additionally, I did experiments with Intel Intrinsics SSE instructions. Because of the size of the cache, it seemed like more than just 6 values could be saved for fast access. In order to get around whether or not the compiler would unroll the loops, used generate_code.py file to generate multiply functions.

Focusing on the beginning of the graph, we can see the the most consistently faster amount of entries saved is 8, which seems significantly smaller than the size of cache, which merits some more investigation.

Strassen Analysis

Strassen's matrix multiplication algorithm is a recursive matrix multiplication algorithm which trades less multiplications for more addition and subtraction operations.

Initially, I had the base case at N=1. Compared to the naive algorithm, the strassen lagged behind the naive until the matrices got bigger than N=4096 at which point the unoptimized single threaded strassen was faster than the single threaded naive.

With both axes taken to log base 2

To optimize, I took the base case out to N=2 and used Intel Intrinsics to the base multiplication and used Intrinsics to make the intermediate matrices, which dramatically lowered the execution time.

With time natural log

About

Summer internship 2020, LLNL HPCCEA. Used Intel MSRs to evaluate and optimize different matrix multiplication algorithms.

matrix-multiplication intel-intrinsics naive-algorithm strassen-multiplication

Languages

Language:C 96.8%Language:Python 2.0%Language:Shell 0.5%Language:Makefile 0.5%Language:C++ 0.1%Language:R 0.1%