This tool calculates the pairwise distance of a set of phylogenetic trees based on the metrics found in the TreeDist R package as well as the standard Robinson-Foulds metric The implemented metrics are Mutual Cluster Information (MSI), Shared Phylogenetic Information (SPI) and Matching Split Information (MSI).
The software calculates the pairwise distances in 3 steps:
- Making splits unique: Many distinct trees contain the same split. To reduce calculation time the trees are scanned and mapped to a ordered list of unique splits.
-
The pairwise unique splits are evaluated on the underlying metric (MSI, SPI, MCI) and stored in a global table which is calculated once. This reduces the amount of necessary calculations by up to 40% depending on the instance. Most beneficial are instances with a high amount of very similar trees as the amount of unique splits is relatively small.
-
The matchings between trees is calculated in parallel using OR-tools. Since the split scores of the metric are immutable and already precalculated the parallelization can be done without hassle. The matching step requires >70% runtime even in parallel and provides ample opportunities for future optimization.
The following software is required to run
- A c++17 ready compiler such as
g++ > 6.0
orclang > 5.0
- Google OR-Tools
- cmake > 3.10
Install using make full && cd build && make
To build without tests run make && cd build && make
The binary file will be located in the folder bin/
- (mandatory) -i path_to_file specifies a path to a file with phylogenetic trees in the Newick format
- (optional) -o path_to_file specifies an output path. Two files will be written an output and an info file.
- (mandatory) -m (MSI/SPI/MCI/RF) specifies the metric for evaluation
- (optional) -n (A) (default) (R/S) specifies the normalization method of either absolute, relative or similarity normalization.
To run an example call just copy and paste the following code in the bin/
folder.
./rfdist -i ../test/res/data/heads/24 -m MSI
without output files or
./rfdist -i ../test/res/data/heads/24 -m MCI -o ../foo/
with output files
We used Softwipe for code quality assessment.
Criteria | Score |
---|---|
Compiler + Sanitizer Score | 10.0/10 |
Assertion Score | 10.0/10 |
Clang-tidy Score | 10.0/10 |
Cppcheck Score | 9.7/10 |
Cyclomatic Complexity Score | 9.1/10 |
Unique Score | 0.0/10 |
KWStyle Score | 10.0/10 |
TestCount Score | 10.0/10 |
Overall score | 8.8/10 |
The version of softwipe seemed to have a bugged Unique Code calculation.
The Experiments have been performed on Ubuntu 20.04 with a AMD Ryzen 5 2500U Radeon Vega Mobile Gfx @2.0Ghz and L1 128KiB, L2 2MiB, L3 4 MiB, 8GB RAM The software was compiled via installation guide using g++ 10.3. The TreeDist R Package was installed via the R installer. The dataset can be found here.
The experiments have been run on the first 10/100 trees of the dataset for each of the three new metrics respectively.