meringlab / og_consistency_pipeline

Consistency pipeline for hierarchies of orthologous groups (OGs) based on subsampling and tree reconciliation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Consistency pipeline for hierarchies of orthologous groups

This repository contains the python implementation for the methodology described in:

Heller, D., Szklarczyk, D. and von Mering, C.: Tree reconciliation combined with subsampling improves large scale inference of orthologous group hierarchies (2018) manuscript in preparation

A preprint of the article can be found on bioRxiv at https://doi.org/10.1101/417840


Content

The current version of the pipeline (v0.4) is a Snakemake workflow written in python3, which relies on the python tree library etetoolkit and the progress bar tqdm. By default the following software is used to compute and reconcile gene trees with species trees:

  • MAFFT for multiple sequence alignment
  • FastTree for tree prediction
  • NOTUNG for tree reconciliation

The binaries of the three tools are downloaded automatically using the snakemake rules specified in rules/tools.smk.

Input files are specified through the configuration file config.yaml, with parameters explained therein. As a small example we provide a dataset from the eggNOG database in the release section under data.tar.gz.

The software has been developed and tested on Linux (Ubuntu 12/16/18.04). Other Unix systems might be suitable as well but binaries will have to be adapted accordingly.

NOTE: If you cloned the repository prior to the 13.11.2018, please make a fresh copy as we applied BFG Repo-Cleaner to remove the example data from the repository history (now found under the release section)

Installation

The easiest way to use the pipeline is to create a python3 environment with the Anaconda/Miniconda distribution (installation instructions here). Assuming that the distrution has been installed, the following commands create a new environment and install all the required dependencies:

# create a new environment named "smk"
conda create -n smk python=3.6
# activate the environment
source activate smk
# install the dependencies (snakemake, ete3, tqdm)
conda install -c bioconda -c conda-forge snakemake
conda install -c etetoolkit ete3 ete_toolchain 
conda install -c conda-forge tqdm

Alternatively the dependencies can also be installed natively using pip or compiled from source by following the respective guides in their documentation.

Example execution

The configuration file config.yaml is predefined with the input parameters for the small example included in data.tar.gz. The archive contains information regarding the Primates level of eggNOG and its two sublevels, Hominidae and Cercopithecoidea:

                             /-314294[prNOG-1][superfamily:Cercopithecoidea]
-9443[prNOG][order:Primates]--
                             \-9604[homNOG][family:Hominidae]

For the 15 member species of the Primates level (see data/9443.primates.species.tsv), the data directory includes FASTA sequences (in data/fastafiles) and orthologous group mappings (in data/orthologous_groups) as well as the clades (in data/clades).

To run the Snakemake workflow:

  1. download the example dataset data.tar.gz from the release section
  2. expand the example dataset with tar -xzf data.tar.gz
  3. (opt) list the outstanding tasks with snakemake -n or snakemake --dag | dot -Tsvg > dag.svg to visualize them as SVG graph
  4. execute the tasks with snakemake
  5. (opt) create a snakemake report with snakemake --report report.html

The software will read the test dataset with 100 OGs from data/orthologous_groups and resolve the hierarchical inconsistencies. After workflow completion (~2 min on a single core) the consistent OG definition can be found in test_output/consistent_ogs. To run a larger example with the complete clustering of the 15 species, change the input parameter in the config.yaml file to point at data/orthologous_groups_full. Be aware that this will require much more time and multi-core execution is strongly reccomended (~1h using 10 cores, i.e. snakemake --cores 10).

Contact

Feedback is always welcome. Feel free to write to davide.heller@imls.uzh.ch

About

Consistency pipeline for hierarchies of orthologous groups (OGs) based on subsampling and tree reconciliation

License:GNU General Public License v3.0


Languages

Language:Python 100.0%