gabefoley / asr_curation

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Run pytest code coverage Twitter

ASR Curation workflow

ASR Curation

Snakemake pipeline for annotating sequences to be used in ancestral sequence reconstruction.

Documentation

Read the full documentation here

Basic concept

A common task in phylogenetics and ancestral sequence reconstruction to have a large set of data you are interested in with the need to curate this data to include only relevant sequences.

This pipeline starts by allowing the user to submit a large set of sequences (in a single FASTA file) and then to also create a set of 'rules' by which the data should be split up - which sequences to include in downstream phylogenetic analyses, based on the metadata that is retrieved, by this pipeline, from the UniProt and BRENDA databases.

As phylogenetic analysis is an iterative process that benefits from a deep understanding of the underlying sequences, these annotations can be viewed in interactive notebooks generated by Jupyter-book, and further sets of rules can be created in .subset files to create alternative subsets of the data.

Because this is executed within a snakemake pipeline, it has the added advantage of keeping all the iterations of subsets available and the rules that exclude sequences clearly defined and therefore entirely reproducible.

Install instructions

  1. Clone this repository to your desktop
git clone https://github.com/gabefoley/asr_curation.git
  1. Create a conda environment
conda create -n asr_curation python=3.9
  1. Activate the conda environment
conda activate asr_curation
  1. Install the required Python packages
pip install -r requirements.txt
  1. Install the following so that they are callable from the command line

Optional (for viewing trees with generated annotation files)

About

License:GNU General Public License v3.0


Languages

Language:Python 84.7%Language:Jupyter Notebook 10.1%Language:HTML 5.2%