running your experimental workflow and trying to check results

Question

running your experimental workflow and trying to check results

gfursin opened this issue 8 years ago · comments

Thanks a lot for sharing your interesting artifacts (it's an impressive amount of work and the automated workflow is handy). I am interested in LLVM optimizations so decided to compile and run your code on my server with Ubuntu 16.04. Docker image worked fine but I am also interested in development, so I also tried "native installation".

First suggestion: it would be useful to add all required dependencies as apt-get command, i.e.
$ sudo apt-get install virtualenv byacc flex
otherwise, I have to figure out myself whether they are installed or how to install them ...

I didn't use SPEC2000 (don't have it), but I have SPEC2006 so I added it to the workflow.

After re-running scripts/artifact_eval.py several times and fixing dependencies, I can configure experiments (using default values), and it seems that the workflow is executed, i.e. I get lots of statistics at some point. Also, I didn't realize that I needed to press 'Space' or Ctrl-D to continue execution at some point (I guess you use "more"), so waited for half-an-hour ;) ... Some remarks about that would be useful ...

I attached the log and all tarred directory with results (I had to put it to the website since it's too be to attach; untarred 1.5Gb - careful) from my machine:

The major problem now, that there is really too much output and it's not matching ReadMe (Data collection and interpretation). I kind of figured out what some of the values means (matching them with your paper an some of the Readme), but it's really not straightforward. Hence, I would like to ask you to help me with interpretation - do you mind to check that workflow is correct and help me explain the attached results, please, i.e. that they are as expected?

In fact, what would really help reviewers is to simply include a reference output from your machine so that we could just make a direct comparison. Or better if you could add to your workflow script parsing of the results and on-the-fly comparison with the pre-recorded ones from your machine. At the end, you can just report that evaluation passed successfully or report any differences. Do you think it's possible to do? For example, see how it's done in this CGO'17 artifact:

SamAinsworth/reproduce-cgo2017-paper#6

Once again, thanks a lot for preparing and sharing this artifact - looks really interesting!

Johannes Doerfert · Answer 1 · Sun Dec 11 2016 03:37:35 GMT+0800 (China Standard Time)

Thanks a lot for sharing your interesting artifacts (it's an impressive amount of work and the automated workflow is handy). I am interested in LLVM optimizations so decided to compile and run your code on my server with Ubuntu 16.04. Docker image worked fine but I am also interested in development, so I also tried "native installation".

Thank you for trying out the artifact but even more for reporting back to us about your experience!

First suggestion: it would be useful to add all required dependencies as apt-get command, i.e.
$ sudo apt-get install virtualenv byacc flex
otherwise, I have to figure out myself whether they are installed or how to install them ...

We added a paragraph to the "Software Requirements" section which lists all the packages we use to initialize the otherwise empty docker container (ubuntu based). Thank you for the idea!

I didn't use SPEC2000 (don't have it), but I have SPEC2006 so I added it to the workflow.

That should be fine, SPEC2000 and SPEC2006 are separate optional dependences.

I can configure experiments (using default values), and it seems that the workflow is executed, i.e. I get lots of statistics at some point.

That sounds like everything works out just fine. Statistics are the major part of the evaluation.

Also, I didn't realize that I needed to press 'Space' or Ctrl-D to continue execution at some point (I guess you use "more"), so waited for half-an-hour ;) ... Some remarks about that would be useful

In the native setup a text editor is spawned (in the background) to display each summary file to the user. It should never block the output though. The following "programs" are tried (in the given order) to open the summary.txt files:
xdg-open, gedit, pluma, kate, mousepad, leafpad, gvim, emacs.
It might be possible that xdg-open uses less (or more) to "open" the text files. In that case we might have to remove xdg-open from the list.

I attached the log and all tarred directory with results (I had to put it to the website since it's too be to attach; untarred 1.5Gb - careful) from my machine:

First, you should update your scripts (pull this git repository) as it seems you used an older version. Second, the workflow-log.txt looks good, though there seem to be some artifacts that disturb the actual output (possibly related to the last point). As a comparison we added the expected output to the repository (resources/sample/evaluation_run_log.txt) as well as the expected summaries for the default evaluation pipeline (resources/sample/{NPB,test_suite,SPEC2000,SPEC2006}.polly_stats_summary.txt).

The major problem now, that there is really too much output and it's not matching ReadMe (Data collection and interpretation). I kind of figured out what some of the values means (matching them with your paper an some of the Readme), but it's really not straightforward.

The scripts are very verbose and the actual data we report does not always correspond to something that can be reported by Polly like this. As a result the evaluation process does not only report all the raw data (and the way it was collected) but also how it is processed and what intermediate results were obtained. We believe this is the only way to make the statistics we present reproducible, though maybe a bit confusing at first.

As a starting point it is best to check the final result of the evaluation first, thus the first part of each summary file ({NPB,test_suite,SPEC2000,SPEC2006}.polly_stats_summary.txt in the results/<eval-date> folder). The tables in Figure 15 and 16 are "rebuild" in each of these summary files for the particular benchmark suite. While one evaluation run cannot generate all numbers in the table, the default configuration will produce almost all. Matching these tables to the ones in the paper should be the first step. Afterwards one can check the rest of the summary file that describes how the statistics that were collected lead to these numbers. Afterwards one has to check the "grep" calls and the options used in the evaluation to follow the data even further to the log files of the test suite.

In fact, what would really help reviewers is to simply include a reference output from your machine so that we could just make a direct comparison.

As mentioned above, the reference output is now included in the resources/sample/ folder.

Or better if you could add to your workflow script parsing of the results and on-the-fly comparison with the pre-recorded ones from your machine. At the end, you can just report that evaluation passed successfully or report any differences.

It is not clear how to do this for anything that does not exactly match our setup. There can be differences in the paths, the actually tested LLVM/Clang/Polly/TestSuite/SPEC version and finally the evaluation parameters that would cause different results without the evaluation "being unsuccessful". With the sample outputs we now added one can compare the default configuration against the expected one though.

Thank you again for all the input. It lead to several improvements of the artifact already!

Grigori Fursin · Answer 2 · Mon Dec 12 2016 00:39:03 GMT+0800 (China Standard Time)

Thanks a lot for improving this artifact and providing reference outputs. Also, thank you for providing list of required Ubuntu packages - I installed all of them in one go and did not encounter any troubles later! I also updated my version and ran the whole workflow again. In the following archive you can find the results and the whole workflow log:

The results are now matching most of your reference outputs (from resources/sample). I just had some minor discrepancies on SPEC2006 but it's likely because my version is not exactly the same as yours. Thanks again for your response and have a good week!

Grigori Fursin · Answer 3 · Mon Dec 12 2016 16:31:02 GMT+0800 (China Standard Time)

By the way, one more general thought about dealing with verbose output - it may be useful to provide some sort of post-processing parser for your output (maybe in Python) to detect important values (SCoP info) and record them in JSON format. Then anyone can easily process them via other automation tools, record them or compare results. For example, I could then easily connect your scripts with Collective Knowledge workflow automation framework and do further processing/visualization of the results using modules shared by the community ... Have a good week!