Spot

Spot identifies the processes in a pipeline that produce different results in different execution conditions.

Installation
Pre-requisites
First example
HCP example
How to Contribute
License

Installation

Simply install the package with pip

$ pip install spottool

Pre-requisites

Install and start Docker
Build Docker images for the pipelines in different conditions (see Dockerfile as an example of PreFreeSurfer pipeline in CentOS7)
Create Boutiques descriptors for the pipeline, in each condition (see descriptor.json and invocation.json as an example of PreFreeSurfer pipeline)
Get provenance information using ReproZip tool in one condition by running: reprozip trace <CMD>

The auto_spot command finds processes that create differences in results obtained in different conditions and reports them in a JSON file.

First example

In this example, we run a bash script that calls the grep command multiple times, creating different output files when run on different OSes. We use spot to compare the outputs obtained in CentOS 7 and Debian 10.

The example can be run in this Git repository as follows:

git clone https://github.com/big-data-lab-team/spot.git
cd spot
pip install .

docker build . -f spot/example/centos7/Dockerfile -t spot_centos_latest
docker build . -f spot/example/debian/Dockerfile -t spot_debian_latest

cd spot/example 

auto_spot -d descriptor_centos7.json -i invocation_centos7.json -d2 descriptor_debian10.json -i2 invocation_debian10.json -s trace_test.sqlite3 -c conditions.txt -e exclude_items.txt -o commands.json .

In this command:

descriptor_<distro>.json is the Boutiques descriptor of the application executed in OS <distro>.
invocation_<distro>.json is the Boutiques invocation of the application executed in OS <distro>, containing the input files.
trace.sqlite3 is a ReproZip trace of the application, acquired in CentOS 7.
condition.txt contains the result folder for each condition.
exclude_items.txt contains the list of items to be ignored while parsing the files and directories.

The command produces the following outputs:

commands_captured_c.json contains the list of processes with temporary files and files written by multiple processes.
commands.json contains the list of processes that create differences in two conditions. Attribute total_commands_multi contains processes that write files written by multiple processes and total_commands contains the other processes.

HCP example

In this example, we run a short PreFreeSurfer pipeline that includes only the ACPC-Alignment step to process only the T1w-image of one subject. The results will show the FLIRT tool as the non-reproducible process in the pipeline when running on different versions of CentOS. We use spot to compare the outputs obtained in CentOS 7 and CentOS 6.

This example takes ~12 mins running and needs ~500 MB space in total. Before running the example, make sure git-lfs is installed on your operating system (See the link ).

The example can be run in this Git repository as follows:

git lfs install
git clone https://github.com/big-data-lab-team/spot.git
cd spot
pip install .

docker build . -f spot/pfs-example/centos7/Dockerfile -t short-pfs-spot-centos7
docker build . -f spot/pfs-example/centos6/Dockerfile -t short-pfs-spot-centos6

cd spot/pfs-example 

auto_spot -d descriptor_centos7.json -i invocation_centos7.json -d2 descriptor_centos6.json -i2 invocation_centos6.json -s trace.sqlite3 -c conditions.txt -e exclude_items.txt -o commands.json .

Furthermore, we can reorder the executions and then merge the identified processes in two different orders by running:

auto_spot -d2 descriptor_centos7.json -i2 invocation_centos7.json -d descriptor_centos6.json -i invocation_centos6.json -s trace.sqlite3 -c conditions2.txt -e exclude_items.txt -o commands2.json .

python ../merge_jsons.py commands.json commands2.json merged.json

The command produces the following output:

merged.json contains the list of processes that create differences in each order of executions.

Expected output

The merged.json file should be similar to the merged_reference.json. In this file, the flirt process is identified under the attribute total_commands as a process that creates different result files, roi2std.mat and acpc_final.nii.gz.

How to Contribute

Clone repo and create a new branch: $ git checkout https://github.com/big-data-lab-team/spot -b name_for_new_branch.
Make changes and test
Submit Pull Request with comprehensive description of changes

big-data-lab-team / spot

Spot

Table of Contents

Installation

Pre-requisites

First example

HCP example

Expected output

How to Contribute

License

About

Languages