The aim of this artifact is to demonstrate the claims made in the paper "The Semantics of Shared Memory in Intel CPU/FPGA Systems".
This manual is divided into two parts: Getting Started Guide that should be finished in 30 minutes and a section where we will provide the Step by Step Instructions by which the paper instructions can be reproduced.
Ensure that docker is installed on your operating system. Furthermore, since the docker container is quite big, ensure that you have at least 10GB of available disk memory. Instructions on how to install docker on a typical Ubuntu system can be found here.
Clone the main git repo using the following command
git clone https://github.com/diorga/harpy.git
From within the git repo, build the docker image.
cd harpy
docker build -t diorga/harpy -f artifact/Dockerfile .
The artifact contains the necessary data and scripts to build a docker container; however, this can be a lengthy process.
- The first alternative is to use the
harpy.tar
pre-built docker image we have provided in the archive. To launch the docker image, please run:
docker load --input harpy.tar # Loading from the tar
- Another alternative to building the docker image, is to direcly pull it from the docker repo using the following command
docker pull diorga/harpy # Pulling it from the docker repo
Next, start the docker container
docker run -it diorga/harpy bash
This should start a docker container where all the experiments can be run.
- Depending on how you installed docker, you might need to prefix every docker command with sudo.
- The time it takes to build the docker container depends on the internet speed. Generally, it should take less than 10 minutes.
To verify that everything has been configured properly, run the following commands.
cd backend
./generate.sh 4
python3 backend.py --cbmc --run_cbmc --max_traces 2 --unwind 7
This will use Alloy to generate all disallowed traces with 4 events. Afterwards, because of the value of the --max_traces
parameter, CBMC will verify just two of them. Because the --unwind
paramater has been set to a low value, this verification step should just take a few minutes. After this, a message that the traces could not be reproduced by the model will be displayed. The message will be simillar to the following one:
If the numbers of prallel threads is too big, the system can run out of memory and report the following error:
SAT checker ran out of memory
In this case, the system will not report the correct result.
This hardware setup will be required further on to verify the final 2 claims. Please ensure that you create an account and are able to connect to it in the Kick-the-Tires Phase.
Our models are also validated on actual hardware. To gain access to the CPU/FPGA systems, an account is needed for the IL Academic Compute Environment. An account can be created by registering here: https://registration.intel-research.net/register
This will require an intel lab sponsor. Brent Forgeron has agreed to help us for the review process. You should fill in his name in the Intel Labs Sponsor for the Research Engagement section.
We cannot control how long it will take Intel to approve the account, but if they do not reply within 72 hours please ask the AEC chairs to contact the paper authors as we may be able to ping our Intel contacts directly.
When the account is created, you should be able to connect to it using the following command:
ssh <username>@ssh-iam.intel-research.net
source /export/fpga/bin/setup-fpga-env fpga-bdx-opae
- Claim 1: Our operational model, implemented in CBMC, is consistent with the traces provided in the Intel manual.
- Claim 2:: Our axiomatic model, implemented in CBMC, is consistent with the traces provided in the Intel manual
- Claim 3: Basic queue operations are validated against the operational model.
- Claim 4: We were able to generate 583 disallowed traces and 153 allowed traces from the axiomatic model.
- Claim 5: The Alloy-generated traces are consistent with the operational model.
- Claim 6: The Alloy-generated traces are consistent with the hardware.
- Claim 7: The corectly synchronised queue and the incorectly synchronised queues behave as expected.
The frst 5 claims do not require any hardware while the last 2 claims require access to the IL Academic Compute Environment.
We verify that the traces described in the manual are consistent with our CBMC implementation of the operational model. We have written the traces as C code that our CBMC implementation can interpret. The manual traces can be found in the following files:
Each file presents the trace and points to the paragraph in the Intel manual where it is described. To verify all the traces against the operational model, run the following commands from the docker container:
cd harpy/backend
python3 backend.py --manual_traces
Two parameters can be changed: the number of traces verified concurrently and the maximum unwind depth. To change the number of concurrent verifications, specify the --cores
command-line argument, and change the unwind depth, specify the --unwind
command-line argument.
For example, the following command runs the same experiment but uses 8 cores (instead of the default 4) and uses an unwind depth of 30 (instead of the default 8).
python3 backend.py --manual_traces --cores 8 --unwind 30
This command will run all traces described in the manual and print a message if the traces were reproducible or not. This verification will take about two to three hours. All allowed traces should be reproducible, and all disallowed traces should not be reproducible.
We also verify that our Alloy implementation of the axiomatic model is consistent with the traces described in the manual and other manually generated traces. We have written the traces as code that our Alloy implementation can interpret. The manual traces can be found in the following file. Because we do not explicitly claim this in the paper and because proving this requires software to be installed outside the docker container, this claim can be skipped.
Unfortunately, this can only be done via the Alloy GUI. For this reason, Alloy needs to be compiled outside the docker. To do this, ANT and JDK need to be installed. On a typical Ubuntu system, this can be done via the following commands:
apt-get install openjdk-8-jdk
apt-get install ant
Afterwards, Alloy can be compiled by executing the following commands:
cd hapry
git submodule update --init --recursive
cd memalloy
make -C alloystar
To start the Alloy GUI:
cd harpy/memalloy/alloystar/dist
chmod +x alloy4.2.jar
./alloy4.2.jar
Afterwards, open the file containing all the Alloy traces and from the Execute menu, click on Execute All. This will validate the axiomatic model with all the manually written Alloy traces.
After running the traces, we can expect that Alloy found an instance of the allowed behaviours but did not find any disallowed behaviours. This will be indicated by the Alloy solver reporting that the solution was "as expected".
We verify that disallowed queue operations cannot occur according to our operational model. Therefore we verify the enqueue and dequeue steps against our CBMC implementation. The file that presents the traces for these operations can be found here
cd backend
python3 backend.py --queue_traces
CBMC should verify that the tested behaviour should not occur and that the disallowed behaviour cannot be reproduced.
We can use the axiomatic model to generate a large number of disallowed executions. The challenge here is that Alloy generates a significant amount of duplicate traces. To address this, we can run our script to eliminate these duplicates. Table 2 shows the total number of traces generated for each event count. To reproduce this table, the following steps are required:
cd backend
./generate <#events> # Alloy to generate all traces with #events
python3 backend.py # Removed duplicate traces
After running this tool for 6 events, a message similar to this one will be displayed:
93 traces read
38 unique disallowed traces
2 unique allowed traces
This message indicates that Alloy generated 93 disallowed traces but after removing duplicates, only 38 were left. From these 38 disallowed traces, by removing fences, only 2 allowed traces were generated.
The previously generated traces can be verified in CBMC. However, doing so will take several days. Therefore, we recommend only verifying a subset of the allowed traces and a subset of the disallowed traces.
Verifying the disallowed traces can be done using the following commands:
cd backend
./generate <#events>
python3 backend.py --cbmc --run_cbmc --max_traces 4
The --cbmc
argument will make the script convert the alloy traces into CBMC code. The --run_cbmc
parameter will make the script execute the CBMC code. The --max_traces
parameter will limit the number of traces that will be verified.
Verifying the allowed traces can be done using the following commands:
cd backend
./generate <#events>
python3 backend.py --cbmc --run_cbmc --fence -r --max_traces 4
The extra --fence
argument will make the script remove fences, making the executions allowed. The -r
argument will tell the script to expect that the traces to be reproducible.
If you posses a machine with a large number of cores, and a significant amount of RAM, you can run more experiments concurrently by setting the ---cores
parameter to a higher number. However, the RAM requirements are quite high.
After all traces have been verified, the following message should appear
All traces could NOT be reproduced
or
All traces could be reproduced
depending on whether we were checking allowed executions or disallowed executions.
After an account has been created for the IL Academic Compute Environment, the user should be able to connect with the following commands:
ssh <username>@ssh-iam.intel-research.net # Connect to the server
source /export/fpga/bin/setup-fpga-env fpga-bdx-opae # Configure the environmental variables such that a Broadwell Xeon CPUs (E5-2600v4) with an integrated in-package Arria 10 is used
Afterwards, a local copy of the repo should be downloaded and the soft-core processor synthesized. This process will take about two hours but only needs to be done once.
cd ~
git clone https://github.com/OPAE/intel-fpga-bbb.git # CLone the inte base building blocks
git clone https://github.com/diorga/harpy.git # Clone this repo
sed -i "s/diorga/<username>/g" harpy/backend/litmus_tests/base/hw/par/cci_mpf_test_base_PAR_files.qsf # Change the <username> to your username. This should update the location of the BBB_CCI_SRC variable
cd harpy/backend/litmus_tests/test/hw
afu_synth_setup -s rtl/sources.txt build_fpga # Configure a synthesis project
cd build_fpga
qsub-synth # This will start the synthesis process
tail -f build.log # The file will report the build status
When the processor is synthesized, the following message should appear:
=======================================================
BDW 503 PR AFU compilation complete
AFU gbs file located at litmus_processor.gbs
Design meets timing
=======================================================
Unfortunately, it is a bit tricky to install Alloy on the Intel server without root access. It will probably be easier to generate the traces locally and transfer them to the server using scp. This can be done with the following commands within the docker container.
cd backend/
./generate <#events>
scp -r traces <username>@ssh-iam.intel-research.net:/homes/<username>/harpy/backend
Depending on what ssh-key you used when registering to the IL Academic Compute Environment, you might need to first transfer the files from the docker to your local machine and then upload them to the server.
To do so, first get the docker continer number using
docker ps
Then copy the traces locally and upload them to the server
docker cp <containerId>:/usr/harpy/backend/traces .
scp -r traces <username>@ssh-iam.intel-research.net:/homes/<username>/harpy/backend
First connect to the Intel server and request an FPGA
ssh <username>@ssh-iam.intel-research.net # connect to the server
source /export/fpga/bin/setup-fpga-env fpga-bdx-opae # set up ENV variables
qsub-fpga # request an fpga
The python script we used in previous claims requires a few extra packages. Because we do not have root access, we used a python virtual environment. Run the actual hardware tests using the following script
cd harpy/backend
source myenv/bin/activate # Set up the virtual environment
python3 backend.py --hardware --run_hardware
deactivate # Dissable the virtual environment
The -hardware
command line argument will tell the python script to generate litmus tests that will run on the CPU and FPGA. The --run_hardware
command will tell the script to actuallly run the tests.
If no litmus tests exhibited weak behaviour, the following message should appear.
All hardware litmus test behaved as expected.
We analyse the behaviour of the queue in Figure 13 and Figure 14. Before the queues can be tested, they first need to be synthesised. This process is similar to the one for synthesizing the soft-core processor.
After an account has been created for the IL Academic Compute Environment, the user should be able to connect with the following commands:
ssh <username>@ssh-iam.intel-research.net # Connect to the server
source /export/fpga/bin/setup-fpga-env fpga-bdx-opae # Configure the environmental variables such that a Broadwell Xeon CPUs (E5-2600v4) with an integrated in-package Arria 10 is used
Afterwards, the queue can be synthesized. This process will take about two hours but only needs to be done once.
sed -i "s/diorga/<username>/g" harpy/case_study/queue/base/hw/par/cci_mpf_test_base_PAR_files.qsf # Change the <username> to your username. This should update the location of the BBB_CCI_SRC variable
cd harpy/case_study/queue/FPGA_enqueue/hw
afu_synth_setup -s rtl/sources.txt build_fpga
cd build_fpga
qsub-synth # This will start the synthesis process
tail -f build.log # The file will report the build status
When the processor is synthesized, the following message should appear:
=======================================================
BDW 503 PR AFU compilation complete
AFU gbs file located at FPGA_enqueue.gbs
Design meets timing
=======================================================
The exact same process needs to be done for the enqueue. Only the first command needs to be changed to the following:
cd harpy/case_study/queue/FPGA_dequeue/hw/
First, connect to the Intel server and request a FPGA.
ssh <username>@ssh-iam.intel-research.net
source /export/fpga/bin/setup-fpga-env fpga-bdx-opae
qsub-fpga # Request an FPGA
Load the enqueue image onto the FPGAL
cd harpy/case_study/queue/FPGA_enqueue/hw/build_fpga
fpgaconf FPGA_enqueue.gbs # Load the FPGA image
This can be done quite straightforward using:
cd harpy/case_study/queue/FPGA_enqueue/sw
make
Run the CPU/FPGA code
./FPGA_enqueue -e 1000000
This will start running the CPU code that will connect to the FPGA and set up the communication with it. Afterwards, the FPGA will attemp to send 1000000 elements to the CPU using the produer consumer queue described in section 7 of the paper. The number of elements sent can be changes by modyfing the -e
parameter.
This will run the experiment described in Figure 13a for 1000000 elements. The final execution time should be then displayed.
Since the queue is correctly synchronised, there should not be any wrong entries.
Run the CPU/FPGA code
./FPGA_enqueue --wr_rsp_enqueue 0 --wr_rsp_write_tail 0 -e 1000000
This will run the experiment and ommit both response synchronisation elements. This will most likely run faster than the correctly synchronised queue. However, the number of wrong entries will be quite low. To get more wrong entries, simillar to Figure 14a, we need to add more noise to the experiment. This can be done with the following command:
./FPGA_enqueue --wr_rsp_enqueue 0 --wr_rsp_write_tail 0 -e 1000000 --VL0_enemy 10
The --VL0_enemy 10
command will tell the FPGA to add extra traffic on the VL0 channel every 10 clock cycles. Alternatively, noise can also be added to other channels such as VH0 and VH1.
This will run the experiment described in Figure 14a for 1000000 elements. The final execution time and incorrectly received elements should be then displayed.
In our paper, we provided an upper bound for the number of incorrectly received elements. For this reason, it can be expected that in the actual experiments, a low number will be obatined.
- Adding more noise to the experiments will significantly increase the execution time of the experiment
- Adding to much noise to the experiments makes the device prone to becoming unresponsive. We have reached out to Intel to make them aware of the problem