Evaluations comparing pthreads and deterministic Linux General strategy ~~~~~~~~~~~~~~~~ The typical strategy for these compute-bounded applications uses fork-join. We fork off N tasks, then join on those N tasks. In the pthreads world, we use pthread_create, then pthreads_join. Deterministic Linux uses a similar facility: bench_fork and bench_join. bench_fork creates a new task with the Snap dput parameter, and bench_join does a Merge dget. The dget merges memory from _etext to _end, the symbols produced by the linker. This region contains the program data (no heap data though). Some programs could benefit from using a restricted merge region, since we know that some applications do not operate on all the memory contained between _etext and _end. Deterministic benchmarks time how long the Merge operation takes. This is accomplished by modifying how we use bench_join; we do a dget(pid, 0, 0, 0, 0) to merely wait until the child terminates, then time the call to bench_join. Each benchmark is run on a 2 socket x 4 core Intel Xeon E5410 2.33GHz machine with 8BG of RAM running Arch Linux. To run a benchmark with N cores, we specify maxcpus=N to the kernel command line to limit the number of available processors. LU decomposition ~~~~~~~~~~~~~~~~ We run on matrices of sizes NxN where N takes on values in {16, 32, 64, 128, 256 512, 1024, 2048}. We partition into 256 blocks (16x16) and fork a thread for each block. Both pthreads and deterministic versions generate the input matrix randomly using identical seed values for each computation (so that both versions operate on the same matrices). We use the N^2 partitioning algorithm described in http://www.cse.uiuc.edu/courses/cs554/notes/06_lu.pdf. The matrix is divided into some number of blocks, and we execute a task for each block. Each task is its own thread. We don't use thread pooling, though we could likely avoid the cost of forking (which is higher for deterministic Linux than pthread_create). The pthreads version forks a thread for each partitioned block, then immediately joining on each forked thread. However, initially most threads will not be runnable. Each thread waits on a pthread condition; a given task waits until the tasks "above" and "to the left" finish. This condition ensures the correct data is available for computing the values of L and U for a given block. The deterministic version uses no condition variables or mutexes (since our API doesn't provide a fully pthreads compliant implementation). Instead of forking a thread for each block at once on initialization, we enter a loop forking threads only when they can do work. The loop forks all tasks that have their data dependencies met, then joins on the forked tasks. This is done until all tasks have been run. Matrix multiplication ~~~~~~~~~~~~~~~~~~~~~ Matrix multiplication of two 1024x1024 matrices. The matrices are partitioned into blocks of size 256x256. The main thread forks 16 tasks to compute each partitioned block. There is no synchronization/locking, so all threads are started immediately, and the main thread joins on each. This is how both the pthreads and deterministic versions work. In fact, the C code for each is nearly identical, the notable exception being replacing pthread_create -> bench_fork and pthread_join -> bench_join. The benchmark computes the matrix multiply ten times and averages the runtime. Each computation uses different input matrices. We generate the matrices randomly using the same seed for both pthreads and deterministic Linux, thus ensuring the two versions multiply the same matrices. md5 ~~~ md5 hash string searcher (e.g. to crack passwords). We search 11 randomly generated strings (they are hardcoded into the program so the pthreads and deterministic versions have identical input). N threads are forked to search for the string where N is the number of processors running. quicksort ~~~~~~~~~