ASR Curation workflow

Snakemake pipeline for annotating sequences to be used in ancestral sequence reconstruction.

Documentation

Basic concept

A common task in phylogenetics and ancestral sequence reconstruction to have a large set of data you are interested in with the need to curate this data to include only relevant sequences.

This pipeline starts by allowing the user to submit a large set of sequences (in a single FASTA file) and then to also create a set of 'rules' by which the data should be split up - which sequences to include in downstream phylogenetic analyses, based on the metadata that is retrieved, by this pipeline, from the UniProt and BRENDA databases.

As phylogenetic analysis is an iterative process that benefits from a deep understanding of the underlying sequences, these annotations can be viewed in interactive notebooks generated by Jupyter-book, and further sets of rules can be created in .subset files to create alternative subsets of the data.

Because this is executed within a snakemake pipeline, it has the added advantage of keeping all the iterations of subsets available and the rules that exclude sequences clearly defined and therefore entirely reproducible.

Install instructions

Clone this repository to your desktop

git clone https://github.com/gabefoley/asr_curation.git

Create a conda environment

conda create -n asr_curation python=3.9

Activate the conda environment

conda activate asr_curation

Install the required Python packages

pip install -r requirements.txt

Install the following so that they are callable from the command line

mafft - callable as mafft
FastTree - callable as FastTree
GRASP - callable as grasp
IQ-TREE 2 - callable as iqtree2

Optional (for viewing trees with generated annotation files)

FigTree

gabefoley / asr_curation

ASR Curation workflow

Documentation

Basic concept

Install instructions

About

Languages