Dahlia Evaluation

This repository contains the evaluation materials for the PLDI 2020 paper "Predictable Accelerator Design with Time-Sensitive Affine Types" using the Dahlia programming language.

If you use our data or the Dahlia language, please cite us:

@inproceedings{10.1145/3385412.3385974,
author = {Nigam, Rachit and Atapattu, Sachille and Thomas, Samuel and Li, Zhijing and Bauer, Theodore and Ye, Yuwei and Koti, Apurva and Sampson, Adrian and Zhang, Zhiru},
title = {Predictable Accelerator Design with Time-Sensitive Affine Types},
year = {2020},
isbn = {9781450376136},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3385412.3385974},
doi = {10.1145/3385412.3385974},
booktitle = {Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation},
pages = {393–407},
numpages = {15},
keywords = {Affine Type Systems, High-Level Synthesis},
location = {London, UK},
series = {PLDI 2020}
}

There are three components to the evaluation:

Benchmarks (this repository).
The Dahlia Compiler: A compiler from Dahlia to Vivado HLS C.
Polyphemus Server: A client–server system for orchestrating large-scale FPGA experiments.

Prerequisites

If you're using the virtual machine image (see below), you just need the hypervisor. Otherwise, to set up the evaluation outside of the VM, start by cloning this repository. You will need these prerequisites:

Get Python 3 if you don't already have it
Install GNU parallel
Install Jupyter with pip3 install jupyter
Install other Python dependencies with pip3 install -r requirements.txt (in this repository)
Install the local benchmarking helpers with cd benchmarking-helpers && pip3 install -e .
Run the sanity checking script ./_scripts/sanity-check.sh to make sure the all the tools are configured correctly.

Getting Started Guide

Download the VM Appliance. The username and password are dahlia.
(Optional, but recommended) Enable multiple physical cores for the Virtual Machine. In Virtual box, select the appliance and under Settings > System > Processor enable all physical cores.
Boot the image in your favorite hypervisor (we tested the image using VirtualBox).
Open a terminal and type cd Desktop/dahlia-evaluation.
Get the latest version of this repository: git pull.
Run ./_scripts/sanity-check.sh. The script should report no errors.
Run ESTIMATE=100 ./_scripts/run-dahlia-accepts.sh. The script runs the dahlia compiler on 100 configurations for each benchmark and reports a time estimate for running on all configurations.
Run jupyter nbconvert --execute main-figures.ipynb and then type ls all-figures/ | wc -l. The reported number should be 13.
Open http://cerberus.cs.cornell.edu:5000. The web page should display the Polyphemus deployment available to PLDI AEC reviewers.

Step-by-Step Guide

For artifact evaluation, we would like reviewers to go through the following steps (each of which is described in detail in a section below):

Configurations accepted by Dahlia: Measure how many points in a large design space are well-typed according to Dahlia's type system.
- Exhaustive DSE: % of configurations accepted by Dahlia.
- Qualitative Studies: % of configurations accepted by Dahlia.
Experimental data and graph generation: See how the raw data for our experiments turns into the charts you see in the paper.
- Regenerate all graphs in the paper using the main-figures.ipynb script.
- (Optional) Open the Jupyter notebook and read the explanation for all the experiments.
- (Optional) Visually inspect the collected data in the repository.
Data collection example: Actually run the HLS experiments in the paper to collect the aforementioned raw data.
- Try out the scaled down data collection example with Polyphemus, our server for running FPGA compilation jobs.
- (Optional) Read the documentation on setting up a new experiment with Polyphemus ("Designing New Experiments" in this file).
(Optional) Using the Dahlia compiler: Compile our example programs and write your own programs, observing the output HLS C code and the error messages.
- (Optional) Rebuild the compiler.
- (Optional) Run the examples and check the error messages generated by the compiler.
- (Optional) Check out the documentation on the language.

Configurations accepted by Dahlia (Estimated time: 4 hours w/ 4 physical cores)

We recommend reviewers use as many physical cores as they have available to speed up this section. The script uses GNU parallel to speed up execution. Actual runtime will depend on the number of cores available.

In this section, we will reproduce the following claims:

Section 5.2

Dahlia accepts 354 configurations, or about 1.1% of the unrestricted design space.

Section 5.3 (stencil2d)

The resulting design space has 2,916 points. Dahlia accepts 18 of these points (0.6%).

Section 5.3 (md-knn)

The full space has 16,384 points, of which Dahlia accepts 525 (3%).

Section 5.3 (md-grid)

The full space has 21,952 points, of which Dahlia accepts 81 (0.4%)

Each claim has two parts: (1) the number of configurations in the design space, and (2) the number of configurations accepted by Dahlia (i.e., they are well-typed according to Dahlia's type checker).

Run the following command:

./_scripts/run-dahlia-accepts.sh

For each benchmark, our script generates k directories where k is the number of configurations and runs the Dahlia compiler on each configuration. The script will report number of configurations accepted for each benchmark. The script generates files with the names *-accepted in the repository root which contain paths to configurations that are accepted by Dahlia. Do not delete the files since they are used for the data collection experiment.

Figures and Pareto points (Estimated time: 10-15 minutes)

In this section, we reproduce all the graphs in the paper from data already committed to the repository. Because actually running the experiments and collecting the data requires access to proprietary compilers and/or hardware, we address data collection in the next section.

In the dahlia-evaluation/ directory, run jupyter notebook. Your browser should open.
Click on main-figures.ipynb.
Click on the "Restart the kernel and re-run the whole notebook" button (⏩️).
All the graphs will be generated within the notebook under the corresponding section.

Note: The color and the background on the graphs might look different but the points and the labels are correct.

Information on saved data: [click to expand] We optionally invite the reviewers to look at our collected data. This section describe where all the saved data is.

Sensitivity analysis (sensitivity-analysis/)

The sensitivity analysis consists of three experiments:

Fig. 4a: Unrolling the innermost loop without any partitioning (sensitivity-analysis/no-partition-unoll/summary.csv).
Fig. 4b: Unrolling with a constant partitioning (sensitivity-analysis/const-partition-unroll/summary.csv)
Fig. 4c: Unrolling and partitioning in lockstep (sensitivity-analysis/lockstep-partition-and-unroll/summary.csv).

Exhaustive DSE (exhaustive-dse/data/)

The exhaustive design space exploration study uses a single experiment with 32,000 distinct configurations to generate the three subgraphs in Figure 7.

Qualitative study (qualitative-study/data/)

The qualitative study consists of three benchmarks:

stencil2d (qualitative-study/stencil2d).
md-knn (qualitative-study/md-knn).
md-grid (qualitative-study/md-grid).

Spatial (spatial-sweep/data/)

The Spatial study consists of one experiment with several configurations to generate Figure 9 (main paper) and Figure 2 (supplementary text).

Data Collection (Estimated Time: 2-3 hours)

This section describes how to actually run the experiments to generate the raw data that goes into the plots demonstrated above. This step is the trickiest because it requires access to proprietary Xilinx toolchains and, in some cases, actual FPGA hardware. We have attempted to make this as painless as possible by using AWS EC2, provides a license-free way to use the Xilinx toolchain and "F1" instances that come equipped with Xilinx FPGAs, and Polyphemus, a server we developed to manage large numbers of FPGA compilation and execution jobs.

Each figure in the paper requires data from different sources:

Exhaustive DSE (fig. 7) & Qualitative Studies (fig. 8): Requires Vivado HLS estimation tools.
Sensitivity analysis (fig. 4): Requires full hardware synthesis toolchain and an FPGA to run the designs.
Spatial Comparison (fig. 8): Requires functional Spatial toolchain that can taget ZedBoard.

Instructions for Artifact Evaluation: These directions will not reproduce the full set of data reported in the paper, which is generally not practical within the evaluation time (fig. 7, for example, took us 2,666 CPU-hours to produce). We instead provide smaller versions of each experiment that are practical to run within a reasonable amount of time. The idea is to to demonstrate that our distributed FPGA experimentation framework is functional to give evidence that our reported data is correct. We also provide instructions to reproduce our original results.

The experiments require access to a deployment of our AWS-based experimentation server. For the PLDI AEC, we've asked the PC chairs for permission to provide the reviewers with access to our deployment. Since it is expensive to keep the servers up, we ask the reviewers to co-ordinate with us to setup two day windows to evaluate our data collection scripts.

For ease of evaluation, we've automated the experiments to generate the data for the qualitative studies. The Makefile at the root of the repository provides rules to automatically submit the jobs, monitor them, and download results to generate graphs.

All three qualitative studies are available to be run. We recommend that reviewers start with the md-grid study first since it has 81 configurations and takes ~2 hours to run on the cluster.

Make sure machsuite-md-grid-accepted is present in the repository root. This file is generated in the "Configurations accepted by Dahlia" step.
Run the following command.

make start-job BENCH=qualitative-study/machsuite-md-grid

The command will generate all the configurations and upload them to cerberus.cs.cornell.edu:5000.
The script will also start the following command to monitor the status of the jobs:
```
watch -n5 ./_scripts/status.py machsuite-md-grid-data/
```

After uploading, most jobs should be in the make stage and some of them in the makeing stage. If there are no jobs in the makeing phase, please message us.
Wait for all jobs to enter the done phase. Once this happens, exit the watch script. If a job is in failed state, see the instructions below.
The following command to generate the resource summary file machsuite-md-grid-data/summary.csv.
```
make summarize-data BENCH=qualitative-study/machsuite-md-grid
```
Run the following command to generate the graph PDF data-collect-machsuite-md-grid-middle-unroll.pdf
```
./qualitative-study/server-scripts/plot.py machsuite-md-grid
```
Compare this PDF to the one generated under all-figures/ for the same benchmark.

To run other benchmarks, replace qualitative-study/machsuite-md-grid with qualitative-study/machsuite-md-knn (525 configurations ~10 hours) or qualitative-study/machsuite-stencil-stencil2d-inner (18 configurations ~ 20 mins).

Note on intermittent failures: During the monitoring phase, some jobs might be reported as failing. The most likely cause of this is a data race within various nodes in the cluster--several execution nodes attempted to execute the same configuration and ended up in an erroneous state.

To re-run a failed job:

Copy the reported job and open the Polyphemus deployment.
Ctrl-F search the job ID and click on the link.
On the job page, click on the "state" dropdown, select "Start Make", and click on "set".
The job will then be restarted.
Please message us if any of the jobs are reported as failed if this doesn't solve the problem.

Note on hanged jobs: The backend compiler (Vivado HLS) might consume a lot of memory based on the jobs and cause the underlying process to not terminate. Unfortunately, there is no way to distinguish such runaway processes from long running estimation jobs. If a job is stuck in the makeing stage for more than two hours, please message us.

(Optional) Using the Dahlia Compiler (Estimated time: 10-15 minutes)

We provide two ways of interacting the evaluating the Dahlia compiler.

Follow the examples on the Dahlia demo webpage. The compiler is packaged and served using Scala.js and does not require connection to a server.
Follow the instructions and rebuild the Dahlia compiler from source. The compiler supports a software backend and has extensive testing to ensure correctness of the various program analyses.
We additionally provide language documentation for the various parts of the compiler.

Reproducing other studies

Polyphemus experiments go through the following flow:

Setup a Polyphemus deployment with multiple estimation machines and at least one FPGA machine. Note that the Polyphemus deployment for PLDI AEC reviewers does not support FPGA machines.

Sensitivity Analysis (Estimated time: 80-120 compute hours/parallelizable)

There are three experiment folders under sensitivity-analysis. For each of the folder, run the following commands. Set the BUILDBOT environment variable to your Polyphemus deployment.

Set variable for specific experiment (we show one example):

export EXPERIMENT=sensitivity-analysis/const-partition-unroll
cd $EXPERIMENT

Generate all configurations:

../../_scripts/gen-dse.py $EXPERIMENT/gemm

Upload the configurations:

../../_scripts/batch.py -p $(basename $EXPERIMENT) -m hw $EXPERIMENT/gemm-*

Wait for all jobs to complete. Monitor them by running:
```
watch -n5 ../../_scripts/status.py ./
```
Download, summarize the data, and generate the graphs
```
make graphs
```

Exhasutive DSE (Estimated time: 2,666 compute hours/parallelizable)

We ran our evaluation on 20 AWS machines, each with 4 workers over the course of a week. This experiment requires babysitting the server fleet and manually restarting some jobs and machines.

Because of the amount of direct interaction required, we assume that the reader has read the documentation for Polyphemus and understands the basics of instance and jobs folder.

Due to the sheer size of the experimentation, we recommend monitoring the job status and extracting the data on one of the servers instead of locally downloading. We provide the scripts to monitor and collect data on the server.

Set a unique prefix to track jobs associated with this experimentation run.
```
export PREFIX=exhaustive
```
Generate all the configurations and upload them:
```
cd exhaustive-dse/ &&
../_scripts/gen-dse.py gemm &&
../_scripts/batch.py -p $PREFIX -m estimate gemm-*
```
Depending on the number of threads for the upload server, this step can take up to two days. However, Polyphemus starts executing jobs as soon as they are uploaded. Keep this script running in a different shell and move onto the next step.
Log on to a Polyphemus server and enter the instance directory that contains the jobs/ folder.
Copy the scripts under exhaustive-dse/scripts/ into this folder.
To monitor the jobs, first run ./get-prefix-jobs.sh $PREFIX which generates a file named $PREFIX-jobs.
Run ./status.sh $PREFIX-jobs to get the state of all the jobs.
When all jobs are in a done state, run:
```
cat $PREFIX-jobs | parallel --progress --bar './extract.py jobs/{}'
```
This generates all the resource summaries under raw/

Finally, run the following to generate a summary CSV.

ls raw/*.json | parallel --progress --bar './to-csv.py raw/{}'

The downloaded CSV can be analyzed using the main.ipynb script in the repository root.

Spatial Comparison (Estimated time: 2 hours)

During the time of submission of the paper, Spatial is still being actively developed and changed. To reproduce our experimental results, please follow the instructions in our fork of spatial-quickstart.

Designing new Experiments

To design a new large scale experiment with Polyphemus, design and parameterize it for use with gen-dse.py. gen-dse.py is a search and replace script that generates folders for each possible configuration.

When invoked on a folder, it looks for a template.json file that maps paramters in files to possible values. For example, the following in files in a folder named bench:

bench.cpp	template.json
int x = ::CONST1::; int y = ::CONST2::; x + y;	{ "bench.cpp": { "CONST1": [1, 2, 3], "CONST2": [1, 2, 3] } }

gen_dse.py will generate 9 configurations in total by iterating over the possible values of CONST1 and CONST2.

Follow the workflow from the "Sensitivity Analysis" study above to upload and run the jobs.

Benchmarking Scripts

The infrastructure for running benchmarks is under the _scripts directory.

For these scripts, you can set a BUILDBOT environment variable to point to the URL of the running Buildbot instance.

batch.py [click to expand]

Submit a batch of benchmark jobs to the Buildbot.

Each argument to the script should be the path to a specific benchmark version in this repository, like baseline/machsuite-gemm-ncubed. Use it like this:

./_scripts/batch.py <benchpath1> <benchpath2> ...

The script creates a new directory for the batch under _results/ named with a timestamp. It puts a list of job IDs in a file called jobs.txt there. It prints the name of the batch directory (i.e., the timestamp) to stdout.

This script has command-line options:

-E: Submit jobs for full synthesis. (The default is to just do estimation.)
-p: Pretend to submit jobs, but don't actually submit anything. (For debugging.)

extract.py [click to expand]

Download results for a previously-submitted batch of benchmark jobs.

On the command line, give the path to the batch directory. Like this:

./_scripts/extract.py _results/2019-07-13-17-13-09

The script downloads information about jobs from jobs.txt in that directory. It saves lots of extracted result values for the batch in a file called results.json there.

summarize.py [click to expand]

Given some extracted data for a batch, summarize the results in a human-friendly CSV.

Give the script the path to a results.json, like this:

./_scripts/summarize.py _results/2019-07-13-17-13-09/results.json

The script produces a file in the same directory called summary.csv with particularly relevant information pulled out.

status.py [click to expand]

Get the current status of a batch while you impatiently wait for jobs to complete. Print out the number of jobs in each state.

Give the script the path to a batch directory:

./_scripts/status.py _results/2019-07-13-17-13-09

Use the watch command to repeatedly run the command every 5 seconds

watch -n5 ./_scripts/status.py _results/2019-07-13-17-13-09

Contact

Please open an issue or email Rachit Nigam.

cucapra / dahlia-evaluation