ccotter/results

Evaluations comparing pthreads and deterministic Linux

General strategy
~~~~~~~~~~~~~~~~
The typical strategy for these compute-bounded applications uses fork-join.
We fork off N tasks, then join on those N tasks. In the pthreads world, we
use pthread_create, then pthreads_join. Deterministic Linux uses a similar
facility: bench_fork and bench_join. bench_fork creates a new task with the
Snap dput parameter, and bench_join does a Merge dget. The dget merges memory
from _etext to _end, the symbols produced by the linker. This region contains
the program data (no heap data though).

Some programs could benefit from using a restricted merge region, since we know
that some applications do not operate on all the memory contained between
_etext and _end. Deterministic benchmarks time how long the Merge operation
takes. This is accomplished by modifying how we use bench_join; we do a
dget(pid, 0, 0, 0, 0) to merely wait until the child terminates, then time the
call to bench_join.

Each benchmark is run on a 2 socket x 4 core Intel Xeon E5410 2.33GHz machine
with 8BG of RAM running Arch Linux. To run a benchmark with N cores, we specify
maxcpus=N to the kernel command line to limit the number of available
processors.

LU decomposition
~~~~~~~~~~~~~~~~
We run on matrices of sizes NxN where N takes on values in {16, 32, 64, 128, 256
512, 1024, 2048}. We partition into 256 blocks (16x16) and fork a thread for
each block.

Both pthreads and deterministic versions generate the input matrix randomly
using identical seed values for each computation (so that both versions
operate on the same matrices).

We use the N^2 partitioning algorithm described in
http://www.cse.uiuc.edu/courses/cs554/notes/06_lu.pdf.
The matrix is divided into some number of blocks, and we execute a task for each
block. Each task is its own thread. We don't use thread pooling, though we could
likely avoid the cost of forking (which is higher for deterministic Linux than
pthread_create).

The pthreads version forks a thread for each partitioned block, then immediately
joining on each forked thread. However, initially most threads will not be
runnable. Each thread waits on a pthread condition; a given task waits until the
tasks "above" and "to the left" finish. This condition ensures the correct data
is available for computing the values of L and U for a given block.

The deterministic version uses no condition variables or mutexes (since our API
doesn't provide a fully pthreads compliant implementation). Instead of forking
a thread for each block at once on initialization, we enter a loop forking
threads only when they can do work. The loop forks all tasks that have their
data dependencies met, then joins on the forked tasks. This is done until all
tasks have been run.

Matrix multiplication
~~~~~~~~~~~~~~~~~~~~~
Matrix multiplication of two 1024x1024 matrices. The matrices are partitioned
into blocks of size 256x256. The main thread forks 16 tasks to compute each
partitioned block. There is no synchronization/locking, so all threads are
started immediately, and the main thread joins on each. This is how both the
pthreads and deterministic versions work. In fact, the C code for each is nearly
identical, the notable exception being replacing pthread_create -> bench_fork
and pthread_join -> bench_join.

The benchmark computes the matrix multiply ten times and averages the runtime.
Each computation uses different input matrices. We generate the matrices
randomly using the same seed for both pthreads and deterministic Linux, thus
ensuring the two versions multiply the same matrices.

md5
~~~
md5 hash string searcher (e.g. to crack passwords).

We search 11 randomly generated strings (they are hardcoded into the program
so the pthreads and deterministic versions have identical input). N threads are
forked to search for the string where N is the number of processors running.

quicksort
~~~~~~~~~
ccotter / results

About