computations / bioinformatics

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Pairwise RF-Distances

This tool calculates the pairwise distance of a set of phylogenetic trees based on the metrics found in the TreeDist R package as well as the standard Robinson-Foulds metric The implemented metrics are Mutual Cluster Information (MSI), Shared Phylogenetic Information (SPI) and Matching Split Information (MSI).

Implementation Gist

The software calculates the pairwise distances in 3 steps:

  1. Making splits unique: Many distinct trees contain the same split. To reduce calculation time the trees are scanned and mapped to a ordered list of unique splits.

  1. The pairwise unique splits are evaluated on the underlying metric (MSI, SPI, MCI) and stored in a global table which is calculated once. This reduces the amount of necessary calculations by up to 40% depending on the instance. Most beneficial are instances with a high amount of very similar trees as the amount of unique splits is relatively small.

  2. The matchings between trees is calculated in parallel using OR-tools. Since the split scores of the metric are immutable and already precalculated the parallelization can be done without hassle. The matching step requires >70% runtime even in parallel and provides ample opportunities for future optimization.

Installation

Requirements

The following software is required to run $our_software_name$

  • A c++17 ready compiler such as g++ > 6.0 or clang > 5.0
  • Google OR-Tools
  • cmake > 3.10

Install using make full && cd build && make

To build without tests run make && cd build && make

The binary file will be located in the folder bin/

Command Line Parameters

  • (mandatory) -i path_to_file specifies a path to a file with phylogenetic trees in the Newick format
  • (optional) -o path_to_file specifies an output path. Two files will be written an output and an info file.
  • (mandatory) -m (MSI/SPI/MCI/RF) specifies the metric for evaluation
  • (optional) -n (A) (default) (R/S) specifies the normalization method of either absolute, relative or similarity normalization.

Example Calls

To run an example call just copy and paste the following code in the bin/ folder.

./rfdist -i ../test/res/data/heads/24 -m MSI without output files or

./rfdist -i ../test/res/data/heads/24 -m MCI -o ../foo/ with output files

Code Quality

We used Softwipe for code quality assessment.

Criteria Score
Compiler + Sanitizer Score 10.0/10
Assertion Score 10.0/10
Clang-tidy Score 10.0/10
Cppcheck Score 9.7/10
Cyclomatic Complexity Score 9.1/10
Unique Score 0.0/10
KWStyle Score 10.0/10
TestCount Score 10.0/10
Overall score 8.8/10

The version of softwipe seemed to have a bugged Unique Code calculation.

Experimental Results

The Experiments have been performed on Ubuntu 20.04 with a AMD Ryzen 5 2500U Radeon Vega Mobile Gfx @2.0Ghz and L1 128KiB, L2 2MiB, L3 4 MiB, 8GB RAM The software was compiled via installation guide using g++ 10.3. The TreeDist R Package was installed via the R installer. The dataset can be found here.

The experiments have been run on the first 10/100 trees of the dataset for each of the three new metrics respectively.

About


Languages

Language:C 61.4%Language:C++ 23.6%Language:Jupyter Notebook 10.3%Language:M4 1.3%Language:Yacc 0.8%Language:Python 0.7%Language:Roff 0.6%Language:CMake 0.5%Language:Lex 0.3%Language:Makefile 0.2%Language:Shell 0.1%