Software Programming for Performance

Know Your Computer

1.

Command used in manjaro cat /proc/cpuinfo

Link: https://ark.intel.com/content/www/us/en/ark/products/191075/intel-core-i59300h-processor-8m-cache-up-to-4-10-ghz.html

Model: Intel(R) Core(TM) i5-9300H CPU
Generation: 9
Frequency Range: 2.40GHz - 4.10GHz
Number of Cores: 4
Hyper Threading Availability: Yes
SIMD ISA: Intel® AVX2
Cache Size: 8192 KB
Maximum Main Memory Bandwidth: 41.8 GB/s

2.

To theoretically calculate peak FLOPS per core, the formula is as follows:

Per Core

FLOPS = cycles_per_second * flops_per_cycle
cycles_per_second = 4.10 GHz = 4.10 * 10^9 
flops_per_cycle = 32 (For coffee lake microarchitecture)
FLOPS = 131.2 * 10^9 
FLOPS = FLOPS / 10^9
FLOPS = 131.2 Gflops

Per Processor

no_of_cores = 4
FLOPS_per_processor = 131.2 * 4 = 524.8 Gflops

Using whetstone benchmark program

Command

./whetstone 1000000

Output

Loops: 1000000, Iterations: 1, Duration: 15 sec.
C Converted Double Precision Whetstones: 6666.7 MIPS

Command

./whetstone 10000000

Output

Loops: 10000000, Iterations: 1, Duration: 103 sec.
C Converted Double Precision Whetstones: 9708.7 MIPS

Running another benchmark gives the following results

cd Flops/version3
bash compile_linux_gcc.sh
./2013-Haswell

The output is stored in flops_benchmark_output.txt

Writing Benchmark to calculate FLOPS

Approach 1

To measure FLOPS we first need code that performs floating point operations, and measure its execution time

void func1(){
    // perform floating point instructions
    double x = 1.5;
    double y = 2.5;
    // total number of floating point instructions performed = 200000000
    for (int i = 0; i < 100000000; i++) {
        x = x + y;
        y = x * y;
    }
}

Command

cd benchmark
gcc benchmark.c -O1
./a.out

Output

----------------------------------------
func1
time: 0.509031 s
----------------------------------------
FLOPS: 3.929034 GFLOPS

As we can see, this benchmark is heavily unoptimized for the following reasons

there is kind of parralelism being used
This can be made efficient by using parallelization using OpenMP or vectorization.

Approach 2: Vectorization

We can add the -O3 flag during compilation to vectorize our code, which speeds up the execution time.

Flags: -g -O3 -lm

Command

make
./benchmark

Output

time: 20.139000 ms
FLOPS: 0.993098 GFLOPS

Approach 3: Parallelization

We can optimize the following code even more by parallelizing it. Since these are very simple, and we are just checking FLOPS and not worrying about accuracy of the result, we can simply ask the compiler to ignore loop dependencies entirely.

Flags: -g -O3 -mfma -fopenmp

#pragma omp parallel for
for (int i = 0; i < 10000000; i++) {
	x = x + y;
	y = x * y;
}

Output

time: 3.535000 ms
FLOPS: 22.657709 GFLOPS

The FLOPS that we are getting in this still very less, most probably because each iteration of the loop is dependent on the previous one and each iteration is writing to the same location.

Approach 4: Using vectors

Instead of using variables, we can use arrays to perform floating point operations on. The function is defined as follows:

    #pragma omp parallel for
    for (long i = 0; i < 10000000; i++)
    {
        for (int j = 0; j < 8; j++)
        {
            Y[j] = a * X[j] + Y[j];
        }
    }

Output

time: 108.014000 ms
FLOPS: 185.161183 GFLOPS

As we can see, using this method we are getting a significant increase in GFlops, which is a speedup of approximately 30.5 from the previous approach

3.

Main memory size: 15.5 GB
Memory type: DDR4

Stream Benchmark

Benchmarking using STREAM submodule

Compler: icc

Command used

cd stream-benchmark/STREAM
make stream.icc
./stream.omp.AVX2.80M.20x.icc

Output

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 80000000 (elements), Offset = 0 (elements)
Memory per array = 610.4 MiB (= 0.6 GiB).
Total memory required = 1831.1 MiB (= 1.8 GiB).
Each kernel will be executed 20 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 56318 microseconds.
   (= 56318 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           28061.6     0.045808     0.045614     0.046849
Scale:          28122.0     0.045844     0.045516     0.049271
Add:            31367.5     0.061685     0.061210     0.062947
Triad:          31334.8     0.061489     0.061274     0.062979
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Compiler: gcc

Command used

cd stream-benchmark/STREAM
make stream_c.exe
./stream_c.exe

Output

-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
 will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 8
Number of Threads counted = 8
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 7344 microseconds.
   (= 7344 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function    Best Rate MB/s  Avg time     Min time     Max time
Copy:           25576.0     0.006524     0.006256     0.008151
Scale:          18857.2     0.008594     0.008485     0.008924
Add:            21613.6     0.011240     0.011104     0.011505
Triad:          21177.1     0.011542     0.011333     0.012044
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------

Therefore, main memory bandwidth using the stream benchmark is coming around 25-28 GB/s

To write our own memory benchmark, we have to perform memory accesses in a loop and calculate the number of bytes that are accessed in the time that it takes to execute the function.

Approach 1

Function

    for(int i = 0; i < 10000000; i++)
    {
        sum += X[i] - Y[i];
    }

Output

time: 24.430000 ms
sum: -118.511756
Memory Bandwidth: 6.549325

The number of memory accesses would be $8*\text{vector-length}*2$

But this is still unoptimized as this does not make use of vectorization or parallelization.

Approach 2: Vectorization

Flags: -g -O3 -lm

Output

time: 12.633000 ms
sum: -118.511756
Memory Bandwidth: 12.665242

Using this approach, the amount memory bandwidth that we get has almost doubled, and the execution time has halved, giving us a speedup of approximately 2x.

Approach 3: Parallelization

Function

    #pragma omp parallel for
    for(int i = 0; i < 10000000; i++)
    {
        sum += X[i] - Y[i];
    }

Flags: -g -O3 -lm -mfma -fopenmp

Output

time: 7.127000 ms
sum: 430.665919
Memory Bandwidth: 22.449839

As we can see, using parallelization, we are getting double the bandwidth from the previous appraoch and an overall speedup of approximately 4x from the initial approach

4.

Secondary Storage Device: HDD

Size: 1 TB
Average Read Rate: 222.4 MB/s
Average Write Rate: 69.8 MB/s
Average Access Time: 4.04 ms

Know your Cluster

ADA Peak FLOPs: 70.66 TFLOPS
Abacus Peak FLOPs: 14 TFLOPS

BLAS Problems

BLAS Level 1

BLAS

Command

cd blas-problems/q1
make
./q1 1
./q1 2
./q1 3
./q1 4
./q1

First method for writing LEVEL 1 functions is written in blas-problems/q1/q1.c

Output (The time given here is in ms)

Approach 1: Without Optimizations

Compiler: icc

Flags: -g -axCORE-AVX2

Execution Time:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	5.683000	11.999000	18.366000	36.138000	31.115000	60.476000
40000000	11.785000	25.601000	35.906000	70.732000	59.858000	121.335000
60000000	20.011000	37.013000	52.773000	102.953000	91.307000	178.677000
80000000	25.380000	48.444000	70.426000	137.163000	120.422000	238.981000
100000000	30.644000	59.337000	86.987000	172.691000	149.039000	298.794000

GFlops:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	3.5192679922576104	1.6668055671305941	2.1779374931939453	1.1068681166638994	1.2855535915153462	0.6614194060453734
40000000	3.3941450997030125	1.5624389672278427	2.2280398819138862	1.1310298026352994	1.3364963747535834	0.6593316025878765
60000000	2.998350907001149	1.62105206278875	2.273890057415724	1.1655804104785679	1.314247538523881	0.6716029483369432
80000000	3.1520882584712373	1.651391297167864	2.2718882230994235	1.1664953376639473	1.3286608759196825	0.6695092915336366
100000000	3.1520882584712373	1.6852891113470516	2.299194132456574	1.1581379458107255	1.3419306356054455	0.6693574837513471

Compiler: gcc

Flags: -g -lm

Execution Time:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	46.826000	49.500000	61.455000	77.696000	78.032000	103.776000
40000000	93.332000	95.730000	121.595000	153.765000	158.216000	209.573000
60000000	138.300000	142.435000	179.921000	231.348000	230.364000	310.813000
80000000	183.993000	187.902000	237.832000	308.195000	306.761000	413.144000
100000000	226.088000	235.211000	299.763000	385.151000	383.734000	517.293000

GFlops:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	0.4271131422713877	0.40404040404040403	0.6508827597429013	0.514827018121911	0.5126102111954071	0.38544557508479804
40000000	0.428577551107873	0.41784184686096315	0.6579217895472675	0.5202744447696159	0.5056378621631188	0.38172856236251806
60000000	0.43383947939262474	0.4212447783199354	0.666959387731282	0.5186991026505524	0.5209147262593113	0.3860842371458079
80000000	0.43479914996766184	0.42575385041138464	0.6727437855292812	0.519151835688444	0.5215786882947963	0.38727417074918186
100000000	0.44230565089699586	0.42515018430260487	0.6671937497289525	0.5192768550516551	0.5211943690160371	0.3866280811841645

GCC VS ICC

Approach 2: With Optimizations (flags)

Compiler: icc

Flags: -g -O3 -axCORE-AVX2

Execution Time:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	5.869000	12.336000	18.123000	36.464000	30.512000	60.612000
40000000	12.054000	27.428000	36.330000	70.216000	60.362000	120.673000
60000000	20.224000	36.869000	52.101000	102.929000	89.885000	179.584000
80000000	26.866000	49.693000	69.725000	137.384000	119.536000	239.356000
100000000	32.710000	61.508000	86.079000	171.000000	149.342000	298.043000

GFlops:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	3.4077	3.4077	2.2071	1.097	1.311	0.6599
40000000	3.3184	3.3184	2.202	1.1393	1.3253	0.6629
60000000	2.9667	2.9668	2.3032	1.1659	1.335	0.6682
80000000	2.9777	2.9777	2.2947	1.1646	1.3385	0.6685
100000000	3.057	3.0572	2.3234	1.1696	1.3392	0.671

Compiler: gcc

Flags: -g -O3 -lm

Execution Time:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	6.377000	13.561000	34.853000	50.526000	32.928000	64.173000
40000000	12.148000	26.763000	69.398000	100.064000	61.586000	124.626000
60000000	19.758000	38.489000	103.804000	150.921000	93.030000	185.809000
80000000	30.590000	50.709000	138.987000	200.757000	125.574000	249.614000
100000000	33.822000	65.588000	174.219000	250.510000	154.668000	311.347000

GFlops:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	3.1363	1.4748	1.1477	0.7917	1.2148	0.6233
40000000	3.2927	1.4946	1.1528	0.7995	1.299	0.6419
60000000	3.0367	1.5589	1.156	0.7951	1.2899	0.6458
80000000	2.6152	1.5776	1.1512	0.797	1.2741	0.641
100000000	2.9567	1.5247	1.148	0.7984	1.2931	0.6424

GCC VS ICC

BLIS

BLIS has been install at the location ~/blis on my system, for which I have updated the makefiles accordingly to link all the required files.

Command

cd blas-problems/blis/q1
make
./q1.x 1
./q1.x 2
./q1.x 3
./q1.x 4
./q1.x

Output

Execution Time:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	5.714000	12.136000	9.010000	14.820000	10.373000	21.235000
40000000	11.926000	30.753000	14.807000	31.321000	18.133000	39.392000
60000000	20.825000	39.367000	21.812000	41.456000	28.326000	55.455000
80000000	26.899000	50.724000	28.183000	54.002000	36.504000	71.568000
100000000	32.963000	64.747000	35.517000	67.153000	45.323000	88.239000

GFlops:

Vector Length	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
20000000	3.5002	1.648	4.4395	2.6991	3.8562	1.8837
40000000	3.354	1.3007	5.4029	2.5542	4.4118	2.0309
60000000	2.8812	1.5241	5.5016	2.8946	4.2364	2.1639
80000000	2.9741	1.5772	5.6772	2.9629	4.3831	2.2356
100000000	3.0337	1.5445	5.6311	2.9783	4.4128	2.2666

GCC VS BLIS

ICC VS BLIS

Operational Intensity

$$ OI = \frac{\text{number of operations}}{\text{number of bytes}} $$

sSCAL

The number of bytes that are involved in the operation will be 4 multiplied by the length of the vector because data type is float. For scaling, one operation is applied per number in the vector, making the total number of operations equal to N $$

\therefore \ OI = \frac{N}{N*4} = 0.25 $$
dSCAL

The number of bytes that are involved in the operation will be 4 multiplied by the length of the vector because data type is double. For scaling, one operation is applied per number in the vector, making the total number of operations equal to N $$ \therefore \ OI = \frac{N}{N*8} = 0.125 $$
sDOT

The number of operations in dot product will be 2*N as each corresponding number in the vector will be multiplied and then this product is being added to a variable. The total number of bytes that are involved will be 2*N*4 as 2 vectors are need for dot product and each number is of size 4 bytes(float). $$ OI = \frac{2N}{2N*4} = 0.25 $$
dDOT

The number of operations in dot product will be 2*N as each corresponding number in the vector will be multiplied and then this product is being added to a variable. The total number of bytes that are involved will be 2*N*8 as 2 vectors are need for dot product and each number is of size 8 bytes(double). $$ OI = \frac{2N}{2N*8} = 0.125 $$
sAXPY

The number of operations in dot product will be 2*N as each corresponding number in the vector will be multiplied and then this product is being added to a variable. The total number of bytes that are involved will be 2*N*4 as 2 vectors are need for dot product and each number is of size 4 bytes(float). $$ OI = \frac{2N}{2N*4} = 0.25 $$
dAXPY

The number of operations in dot product will be 2*N as each corresponding number in the vector will be multiplied and then this product is being added to a variable. The total number of bytes that are involved will be 2*N*8 as 2 vectors are need for dot product and each number is of size 8 bytes(double). $$ OI = \frac{2N}{2N*8} = 0.125 $$

Baseline and Best Execution Times (in ms)

Function	Baseline Execution Time	Best Execution Time
sSCAL	226.088000	30.644000
dSCAL	235.211000	59.337000
sDOT	299.763000	86.079000
dDOT	385.151000	171.000000
sAXPY	383.734000	149.039000
dAXPY	517.293000	298.043000

Speedup

Using the data for vector length = 100000000 $$ \text{Speedup} = \frac{\text{Baseline Execution Time}}{\text{Best Execution Time}} $$

sSCAL $$ \text{Speedup} = \frac{226.088}{30.644} = 7.37888 $$
dSCAL $$ \text{Speedup} = \frac{235.211}{59.337} = 3.9639 $$
sDOT $$ \text{Speedup} = \frac{299.763}{86.079} = 3.48241 $$
dDOT $$ \text{Speedup} = \frac{385.151}{171} = 2.252345 $$
sAXPY $$ \text{Speedup} = \frac{383.734}{149.039} = 2.574722 $$
dAXPY $$ \text{Speedup} = \frac{517.293}{298.043} = 1.735632 $$

Baseline and Optimized GFlops

Using the data for vector length = 100000000 $$ \text{GFlops} = \frac{\text{Number of floating point operations}}{\text{Execution Time (in ms)}*10^6} $$

For sSCAL and dSCAL, $$ \text{GFlops} = \frac{N}{Time} $$ and for the rest of the functions $$ \text{GFlops} = \frac{2*N}{Time} $$

	sSCAL	dSCAL	sDOT	dDOT	sAXPY	dAXPY
Baseline GFlops	0.4423	0.4251	0.66719	0.51927	0.52119	0.38662
Optimized GFlops	3.152	3.0572	5.6311	2.9783	4.4128	2.2666

Is the problem memory bound or compute bound?

The process is CPU bound as the memory bandwidth that we are getting is much lesser than the achieveable bandwidth

Memory BandWidth

$$ \text{Memory Bandwidth} = \frac{\text{Number of bytes accessed}}{\text{time}} $$

For sSCAL, $$ \text{Memory Bandwidth} = \frac{N4}{\text{Time}} $$ For dSCAL, $$ \text{Memory Bandwidth} = \frac{N8}{\text{Time}} $$ For sDOT, $$ \text{Memory Bandwidth} = \frac{N42}{\text{Time}} $$ For dDOT, $$ \text{Memory Bandwidth} = \frac{N82}{\text{Time}} $$ For sAXPY, $$ \text{Memory Bandwidth} = \frac{N42}{\text{Time}} $$ For dAXPY, $$ \text{Memory Bandwidth} = \frac{N82}{\text{Time}} $$ Using the data for vector length = 100000000, for the BLIS implementation, since it is the most optimized

Function	Memory Bandwidth (in GB/s)
sSCAL	12.1348
dSCAL	12.3557
sDOT	22.5244
dDOT	23.8261
sAXPY	17.651
dAXPY	18.1325

BLAS Level 2

Assuming the vector to be a column vector, which makes it easier to write the function, as we can directly multiply the matrix with the vector without having to calculate the transpose of the matrix.

BLAS

Command

cd blas-problems/q2/cblas_sgemv
make
./q2 1
./q2 2
./q2 3
./q2 4
./q2 5

and

cd blas-problems/q2/cblas_dgemv
make
./q2 1
./q2 2
./q2 3
./q2 4
./q2 5

Approach 1: Without Optimizations

Compiler: icc

Flags: -g -axCORE-AVX2

Execution Time:

Rows	Columns	sGEMV	dGEMV
2000	4000	1.045000	2.003000
4000	6000	3.712000	6.952000
6000	8000	8.011000	15.216000
8000	10000	13.690000	26.755000
10000	12000	21.215000	41.746000

GFlops:

Rows	Columns	sGEMV	dGEMV
2000	4000	22.9722	11.985
4000	6000	19.3998	10.3585
6000	8000	17.9775	9.4649
8000	10000	17.5328	8.9712
10000	12000	16.9705	8.6243

Compiler: gcc

Flags: -g -lm

Execution Time:

Very Skeptical of the data for gcc

Rows	Columns	sGEMV	dGEMV
2000	4000	24.295000	25.143000
4000	6000	74.686000	75.594000
6000	8000	149.277000	155.780000
8000	10000	249.542000	257.111000
10000	12000	369.120000	381.355000

GFlops:

Rows	Columns	sGEMV	dGEMV
2000	4000	0.9879	0.9546
4000	6000	0.9641	0.9525
6000	8000	0.9647	0.9244
8000	10000	0.9618	0.9335
10000	12000	0.9753	0.944

Approach 2: Vectorization using -O3

Compiler: icc

Flags: -g -axCORE-AVX2 -O3

Execution Time:

Rows	Columns	sGEMV	dGEMV
2000	4000	1.104000	1.823000
4000	6000	3.272000	6.651000
6000	8000	7.040000	14.259000
8000	10000	12.097000	24.133000
10000	12000	18.616000	36.436000

GFlops:

Rows	Columns	sGEMV	dGEMV
2000	4000	21.7409	13.1662
4000	6000	22.0061	10.826
6000	8000	20.4554	10.0993
8000	10000	19.8403	9.9452
10000	12000	19.3387	9.8806

Compiler: gcc

Flags: -g -O3 -lm

Execution Time:

Rows	Columns	sGEMV	dGEMV
2000	4000	8.149000	8.225000
4000	6000	24.035000	25.281000
6000	8000	48.436000	51.744000
8000	10000	82.096000	86.864000
10000	12000	124.460000	130.238000

GFlops:

Rows	Columns	sGEMV	dGEMV
2000	4000	2.9454	2.9182
4000	6000	2.9958	2.8481
6000	8000	2.9731	2.783
8000	10000	2.9235	2.763
10000	12000	2.8926	2.7642

Approach 3: Parallelization using OpenMP

To facilitate parallelization, a temp variable was defined separately so that Y is not accessed in the inner for loop, which removes the loop dependencies and allows for parallelization.

#pragma omp parallel for
for(int i = 0; i < M; i++)
{
 	double temp = 0.0;
    Y[i*incY] = beta * Y[i*incY];
    for(int j = 0; j < N; j++)
    {
        temp += alpha * A[i*lda+j] * X[j*incX];
    }
    Y[i*incY] += temp;
}

Compiler : icc

Flags: -g -axCORE-AVX2 -O3 -qopenmp

Execution Time:

Rows	Columns	sGEMV	dGEMV
2000	4000	1.872000	2.402000
4000	6000	3.450000	5.746000
6000	8000	6.038000	11.031000
8000	10000	10.227000	19.049000
10000	12000	14.853000	27.858000

GFlops:

Rows	Columns	sGEMV	dGEMV
2000	4000	12.8216	9.9925
4000	6000	20.8707	12.5312
6000	8000	23.85	13.0547
8000	10000	23.4681	12.5995
10000	12000	24.2382	12.923

Compiler: gcc

Flags: -g -O3 -lm -fopenmp

Execution Time:

Rows	Columns	sGEMV	dGEMV
2000	4000	1.208000	1.530000
4000	6000	3.467000	4.771000
6000	8000	6.805000	10.778000
8000	10000	11.866000	17.777000
10000	12000	17.092000	27.520000

GFlops:

Rows	Columns	sGEMV	dGEMV
2000	4000	19.8692	15.6876
4000	6000	20.7684	15.092
6000	8000	21.1618	13.3611
8000	10000	20.2265	13.501
10000	12000	21.0631	13.0818

BLIS

Command

cd blas-problems/blis/q2
make
./q1.x 1
./q1.x 2
./q1.x 3
./q1.x 4
./q1.x 5

Execution Time:

Rows	Columns	sGEMV	dGEMV
2000	4000	1.384000	2.449000
4000	6000	4.267000	7.365000
6000	8000	7.819000	14.861000
8000	10000	12.805000	27.344000
10000	12000	19.069000	37.570000

GFlops:

Rows	Columns	sGEMV	dGEMV
2000	4000	17.3425	9.8007
4000	6000	16.8746	9.7765
6000	8000	18.4174	9.6902
8000	10000	18.7433	8.7774
10000	12000	18.8793	9.5824

sGEMV Performance Graph

Both baseline and vectorized gcc versions performed poorly as compared to the rest, because of which the other lines in the graph were become indiscernible, so for this reason those werent plotted.

dGEMV Performance

Baseline gcc version performed poorly as compared to the rest, because of which the other lines in the graph were become indiscernible, so for this reason it wasnt plotted.

Baseline and Best Execution Times (in ms)

Using the data for matrix dimensions 10000*12000

Function	Baseline Execution Time	Best Execution Time
sGEMV	369.120000	14.853000
dGEMV	381.355000	27.520000

Speedup

For sGEMV,

$$ \text{Speedup} = \frac{369.12}{14.853} = 24.8515 $$ 2. For dGEMV,

$$ \text{Speedup} = \frac{381.355}{27.52} = 13.8573 $$

Baseline and Optimized GFlops

For the following function

for(int j = 0; j < M; j++)
{
    Y[j*incY] = beta * Y[j*incY];
    for(int i = 0; i < N; i++)
    {
    	Y[j*incY] += alpha * A[i*lda+j] * X[i*incX];
    }
}

For the following function, we can say that the total number of operations would be as follows

Inside the inside for loop, the line that is being executed contains 3 floating point operations which is being repeated N times.
The line just above the inner for loop is one floating point instruction
Both these instructions above are repeated M times, therefore the total number of floating point operations will be $M*(3*N + 1)$

$$ \text{GFlops} = \frac{\text{3MN + M}}{\text{Time(in ms)}10^6} $$ Using the data for matrix dimensions 1000012000

	sGEMV	dGEMV
Baseline GFlops	0.9753	0.944
Optimized GFlops	24.2382	13.0818

Operational Intensity

$$ OI = \frac{\text{number of operations}}{\text{number of bytes}} $$

sGEMV: The number of operations is equal to $M*(3N + 1)$. The total number of bytes is calculated as follows. N(Y) is defined as the total number of bytes in Y. $$ \text{Total number of bytes} = N(A) + N(X) + N(Y) \ = 4(MN + M + N) \ OI = \frac{M(3N+1)}{4*(MN+M+N)} $$
dGEMV: The calculation remains the same as sGEMV, but instead each number is represented using 8 bytes. $$ OI = \frac{M*(3N+1)}{8*(MN+M+N)} $$

Memory Bandwidth

$$ \text{Memory Bandwidth} = \frac{\text{Number of bytes accessed}}{\text{time}} $$

sGEMV $$ \text{Memory Bandwidth} = \frac{4(MN + M + N)}{\text{Time}} $$
dGEMV $$ \text{Memory Bandwidth} = \frac{8(MN + M + N)}{\text{Time}} $$

Using the data for matrix dimensions 10000*12000

Function	Memory Bandwidth (GB/s)
sGEMV	32.322
dGEMV	34.8901

Is the Process CPU Bound or Memory Bound?

The process is CPU bound as the memory bandwidth that we are getting is much lesser than the achieveable bandwidth

BLAS Level 3

BLAS

Command

cd blas-problems/q3/xgemm
make
./q3 1
./q3 2
./q3 3
./q3 4
./q3 5

Approach 1: Without Optimizations

Compiler: icc

Execution Time:

M	K	N	sGEMM	dGEMM
200	400	600	38.059000	53.497000
600	800	1000	238.617000	357.391000
1000	1200	1400	992.434000	1762.003000
1400	1600	1800	9095.817000	11328.725000
2000	2200	2400	41747.513000	47936.548000

GFlops:

M	K	N	sGEMM	dGEMM
200	400	600	3.786752	2.693983
600	800	1000	6.037290	4.030879
1000	1200	1400	5.079834	2.861175
1400	1600	1800	1.330119	1.06795
2000	2200	2400	0.758962	0.66097

Compiler: gcc

Execution Time:

M	K	N	sGEMM	dGEMM
200	400	600	118.037000	120.131000
600	800	1000	1287.008000	1632.157000
1000	1200	1400	5529.920000	6403.790000
1400	1600	1800	17294.369000	19182.455000
2000	2200	2400	46805.713000	55708.977000

GFlops:

M	K	N	sGEMM	dGEMM
200	400	600	1.220973	1.1997
600	800	1000	1.119340	0.8826
1000	1200	1400	0.6078	0.7873
1400	1600	1800	0.6995	0.6307
2000	2200	2400	0.6769	0.5688

Approach 2: Vectorization using -O3

Compiler: icc

Flags: -g -axCORE-AVX2 -O3 -c

Execution Time:

M	K	N	sGEMM	dGEMM
200	400	600	37.538000	52.732000
600	800	1000	240.426000	365.831000
1000	1200	1400	1021.825000	1705.998000
1400	1600	1800	9269.478000	11353.136000
2000	2200	2400	41260.255000	49184.683000

GFlops:

M	K	N	sGEMM	dGEMM
200	400	600	3.839309	2.7331
600	800	1000	5.991865	3.396176
1000	1200	1400	4.93372	2.9551
1400	1600	1800	1.305199	1.0657
2000	2200	2400	0.76793	0.6442

Compiler: gcc

Flags: -g -O3 -lm -c

Execution Time:

M	K	N	sGEMM	dGEMM
200	400	600	71.580000	75.929000
600	800	1000	482.551000	554.174000
1000	1200	1400	1916.106000	2373.110000
1400	1600	1800	9020.441000	11333.940000
2000	2200	2400	39199.686000	45908.279000

GFlops:

M	K	N	sGEMM	dGEMM
200	400	600	2.013412	1.8981
600	800	1000	2.985384	2.5995
1000	1200	1400	2.631065	2.1244
1400	1600	1800	1.341233	1.0675
2000	2200	2400	0.808292	0.6902

Approach 3: Parallelization using OpenMP

float cblas_xdot(const int N, const float  *A, const int incX, const float  *W, const int incY)
{
    float sum = 0.0;
    #pragma omp parallel for reduction(+:sum)
    for(int i = 0; i < N; i++)
    {
        sum += A[i*incX] * W[i*incY];
    }
    return sum;
}

void cblas_xgemm(const enum CBLAS_ORDER Order, const enum CBLAS_TRANSPOSE TransA, const enum CBLAS_TRANSPOSE TransB, const int M, const int N, const int K, const float alpha, const float *A, const int lda, const float *B, const int ldb, const float beta, float *C, const int ldc)
{
    #pragma omp parallel for
    for(int i = 0; i < M; i++)
    {
        for(int j = 0; j < N; j++)
        {
            phiC(i,j) += cblas_sdot(K, rowA(i), 1, colB(j), ldb);
            phiC(i,j) *= alpha;
        }
    }
}

Compiler: icc

Flags: -g -axCORE-AVX2 -O3 -qopenmp

Execution Time:

M	K	N	sGEMM	dGEMM
200	400	600	35.559000	54.183000
600	800	1000	216.657000	333.097000
1000	1200	1400	1595.501000	1641.880000
1400	1600	1800	3926.005000	4151.858000
2000	2200	2400	10702.296000	11559.818000

GFlops:

M	K	N	sGEMM	dGEMM
200	400	600	4.052982	2.6599
600	800	1000	6.649220	4.3249
1000	1200	1400	3.1598	3.0705
1400	1600	1800	3.0816	2.914
2000	2200	2400	2.9606	2.7409

Compiler: gcc

Flags: -g -O3 -lm -fopenmp

Execution Time:

M	K	N	sGEMM	dGEMM
200	400	600	57.468000	62.228000
600	800	1000	277.563000	401.194000
1000	1200	1400	1795.891000	1864.001000
1400	1600	1800	4282.664000	4551.218000
2000	2200	2400	11677.727000	12886.488000

GFlops:

M	K	N	sGEMM	dGEMM
200	400	600	2.5078	2.316
600	800	1000	2.9902	2.5908
1000	1200	1400	2.8072	2.7046
1400	1600	1800	2.825	2.6583
2000	2200	2400	2.7133	2.4588

BLIS

Command

cd blas-problems/blis/q3
make
./q3.x 1
./q3.x 2
./q3.x 3
./q3.x 4
./q3.x 5

Execution Time:

M	K	N	sGEMM	dGEMM
200	400	600	4.046000	9.011000
600	800	1000	11.519000	20.775000
1000	1200	1400	31.337000	60.834000
1400	1600	1800	75.169000	147.386000
2000	2200	2400	183.760000	396.114000

GFlops:

M	K	N	sGEMM	dGEMM
200	400	600	35.6204	15.9938
600	800	1000	125.0629	69.343
1000	1200	1400	160.8769	82.8714
1400	1600	1800	160.9509	82.0873
2000	2200	2400	172.4249	79.9891

sGEMM Performance

dGEMM

Baseline and Best Execution Times

Using the data for M = 2000, K = 2200, N = 2400

Function	Baseline Execution Time	Best Execution Time
sGEMM	46805.713000	183.760000
dGEMM	55708.977000	396.114000

Speedup

For sGEMM $$ \text{Speedup} = \frac{46805.713000}{183.760000} = 254.7111 $$
For dGEMM $$ \text{Speedup} = \frac{55708.977000}{396.114000} = 140.63874 $$

Baseline and Optimized GFlops

For the following function

for(int i = 0; i < M; i++)
{
    for(int j = 0; j < N; j++)
    {
        phiC(i,j) += cblas_sdot(K, rowA(i), 1, colB(j), ldb);
        phiC(i,j) *= alpha;
    }
}

Inside the inside for loop, the line that is being executed contains 3 floating point operations which is being repeated K times.
The line just below this is one floating point instruction
Both these instructions above are repeated $MN$ times, therefore the total number of floating point operations will be $MN*(3*K + 1)$

$$ \text{GFlops} = \frac{MN(3K+1)}{\text{Time(in ms)}*10^6} $$

Using the data for M = 2000, K = 2200, N = 2400

	sGEMM	dGEMM
Baseline GFlops	0.6769	0.5688
Optimized GFlops	172.4249	79.9891

Memory Bandwidth

$$ \text{Memory Bandwidth} = \frac{\text{Number of bytes accessed}}{\text{time}} $$

sGEMM $$ \text{Memory Bandwidth} = \frac{4(MN + MK + NK)}{\text{Time}} $$
dGEMM $$ \text{Memory Bandwidth} = \frac{8(MN + MK + NK)}{\text{Time}} $$

Using the data for M = 2000, K = 2200, N = 2400

Function	Memory Bandwidth (GB/s)
sGEMV	0.315193
dGEMV	0.292441

Operational Intensity

$$ OI = \frac{\text{number of operations}}{\text{number of bytes}} $$

sGEMM $$ OI = \frac{MN(3K+1)}{4(MN + MK + NK)} $$
dGEMM $$ OI = \frac{MN(3K+1)}{8(MN + MK + NK)}** $$

Is the Process CPU Bound or Memory Bound?

The process is CPU bound as the memory bandwidth that we are getting is much lesser than the achieveable bandwidth

Stencil Computation

Here benchmarking is done by varying the size of the stencil from 3 to 83 with steps of 20 in between. Three 2 optimizations are done namely

Vectorization using -O3
Parallelization using OpenMP

The parallelization is done as follows

#pragma omp parallel for
    for (int i = 0; i < dimY; i++)
    {
        for (int j = 0; j < dimX; j++)
        {
            Y[i * dimX + j] = 0.0;
            for (int kx = 0; kx < k && i + kx < dimY; kx++)
            {
                float temp = 0.0;
                // #pragma omp parallel for reduction(+:temp)  
                for (int ky = 0; ky < k && j + ky < dimX; ky++)
                {
                    temp += X[(i + kx) * dimX + (j + ky)] * S[kx * k + ky];
                }
                Y[i * dimX + j] += temp;
            }
        }
    }
}

HD

Compiler: gcc

Flags for vectorization: -g -O3 -lm -fopenmp -c

Flags for parallelization: -g -O3 -lm -c

The format in each cell is (Execution Time(ms), GFlops, Memory Bandwidth)

Stencil Size	Without Optimizations	Vectorization	Parallelization
3	93.878000,0.397588, 0.176706	23.635000,1.579217, 0.701874	4.608000,8.100000, 3.600000
23	4378.076000,0.501103, 0.003789	1110.388000,1.975768, 0.014940	122.757000,17.871639, 0.135135
43	15249.852000,0.502836, 0.001088	3795.378000,2.020398, 0.004371	499.319000,15.357262, 0.033223
63	33102.726000,0.497247, 0.000501	7936.417000,2.074014, 0.002090	956.817000,17.203119, 0.017337
83	55492.324000,0.514847, 0.000299	13541.367000,2.109836, 0.001225	1789.617000,15.964344, 0.009269

Compiler: icc

Flags for vectorization: -g -axCORE-AVX2 -O3

Flags for parallelization: -g -axCORE-AVX2 -O3 -qopenmp

The format in each cell is (Execution Time(ms), GFlops, Memory Bandwidth)

Stencil Size	Without Optimizations	Vectorization	Parallelization
3	26.411000,1.413229, 0.628102	23.276000,1.603574, 0.712700	8.763000,4.259363, 1.893050
23	862.628000,2.543239, 0.019231	791.390000,2.772171, 0.020962	225.761000,9.717661, 0.073479
43	2860.697000,2.680526, 0.005799	2811.835000,2.727106, 0.005900	750.108000,10.222758, 0.022115
63	6420.337000,2.563765, 0.002584	6143.960000,2.679092, 0.002700	1665.595000,9.882497, 0.009960
83	12027.179000,2.375458, 0.001379	11989.756000,2.382873, 0.001384	2877.264000,9.929593, 0.005765

UHD

Compiler: gcc

Flags for vectorization: -g -O3 -lm -fopenmp -c

Flags for parallelization: -g -O3 -lm -c

The format in each cell is (Execution Time(ms), GFlops, Memory Bandwidth)

Stencil Size	Without Optimizations	Vectorization	Parallelization
3	350.949000,0.425416, 0.189074	64.631000,2.310025, 1.026678	18.118000,8.240380, 3.662391
23	13243.805000,0.662610, 0.005010	2442.030000,3.593517, 0.027172	499.481000,17.569187, 0.132848
43	44057.716000,0.696193, 0.001506	10637.017000,2.883580, 0.006238	2032.337000,15.092325, 0.032650
63	94987.475000,0.693154, 0.000699	21839.558000,3.014756, 0.003038	4025.289000,16.356825, 0.016485
83	168034.916000,0.680098, 0.000395	42745.312000,2.673515, 0.001552	7533.248000,15.170116, 0.008808

Compiler: icc

Flags for vectorization: -g -axCORE-AVX2 -O3

Flags for parallelization: -g -axCORE-AVX2 -O3 -qopenmp

The format in each cell is (Execution Time(ms), GFlops, Memory Bandwidth)

Stencil Size	Without Optimizations	Vectorization	Parallelization
3	121.918000,1.224587, 0.544261	103.103000,1.448059, 0.643582	31.414000,4.752633, 2.112281
23	3242.232000,2.706615, 0.020466	3473.746000,2.526228, 0.019102	899.212000,9.759073, 0.073793
43	11546.700000,2.656403, 0.005747	11816.069000,2.595846, 0.005616	3048.473000,10.061657, 0.021767
63	25142.785000,2.618682, 0.002639	25954.868000,2.536748, 0.002557	6623.252000,9.940879, 0.010019
83	49031.951000,2.330730, 0.001353	48800.378000,2.341790, 0.001360	11740.869000,9.733542, 0.00565

GCC Performance

ICC Performance

Operational Intensity

For each pixel of the output, we have to perform $2k^2$ operations per pixel $$ \text{Operational Intensity} = \frac{2k^2XY}{8XY + 8*k^2} $$

Baseline and Best Execution Time (in ms)

The data for k = 83 is being used

HD

Baseline Execution Time: 55492.324000

Best Execution Time: 1789.617000

UHD

Baseline Execution Time: 168034.916000

Best Execution Time: 7533.248000

Speedup

HD

$$ \text{Speedup} = \frac{55492.324000}{1789.617000} = 31.0079 $$

UHD

$$ \text{Speedup} = \frac{168034.916000}{7533.248000} = 22.2467 $$

Is the Process CPU Bound or Memory Bound?

The process is CPU bound as the memory bandwidth that we are getting is much lesser than the achieveable bandwidth