smRecSearch

A Snakemake implementation of RecSearch

Featured in:

What follows is a use case of smRecSearch as documented in the GitHub Repo for Vazquez and Lynch (2021).

System Requirements

Local Computer: The RecSearch and de novo transcritome assembly steps are the most resource-intensive parts of this pipeline. RecSearch was run using workstation equipped with two 20-core processors @ 2.4 GHz and 128 GB RAM; however, RecSearch has a memory saver option that lowers the memory requirement to 8-16 GB, depending on the genome. The snakemake rule for RecSearch can be edited to use the memory_saver_level = 2 in order to run this script on common equipment.

Computing Clusters & Cloud Computing: Individual parts of this pipeline were tested on the Midway2 Computing Cluster at the University of Chicago, which uses a SLURM job scheduler. Courtesy of snakemake, the entire pipeline should be compatible with various remote computing workflows, with caveats discussed in "Usage."

Operating System: This pipeline was developed on Linux, with additional testing on Mac computers. I would strongly urge Windows uses to set up Windows Subsystem for Linux before running this - and really any - computational analysis.

Setup

Install conda, bioconda, and snakemake on your computer.
Install gfServer and gfClient from UCSC: http://hgdownload.soe.ucsc.edu/admin/exe/
Clone this repo:
git clone https://github.com/docmanny/smRecSearch.git
Create and install the conda environment needed for this pipeline:
conda env create --name RecSearch --file envs/conda_env.yaml
Activate the environment using conda activate RecSearch.

Data needed prior to use

Genomes (FASTA)

The core program underlying this pipeline, RecSearch, is flexible in its use of genome types and search algorithms. To reproduce this paper, you will need the genomes in the following table. Be sure to save them to the data/genomes folder.

Species	Common Name	Highest Quality Genome	Citation/Link
Choloepus hoffmanni	Hoffmans two-toed sloth	choHof-C_hoffmanni-2.0.1_HiC	DNAZoo
Chrysochloris asiatica	Cape golden mole	chrAsi1	NCBI: chrAsi1
Dasypus novemcinctus	Nine-banded armadillo	dasNov3	NCBI: dasNov3
Echinops telfairi	Lesser Hedgehog Tenrec	echTel2	NCBI: echTel2
Elephantulus edwardii	Cape elephant shrew	eleEdw1	NCBI: eleEdw1
Elephas maximus	Asian elephant	eleMaxD	Palkopoulou et al. 2018
Loxodonta africana	African savanna elephant	loxAfr4	ftp://ftp.broadinstitute.org/pub/assemblies/mammals/elephant/loxAfr4
Loxodonta cyclotis	African forest elephant	loxCycF	Palkopoulou et al. 2018
Mammut americanum	American mastodon	mamAmeI	Palkopoulou et al. 2018
Mammuthus columbi	Columbian mammoth	mamColU	Palkopoulou et al. 2018
Mammuthus primigenius	Woolly mammoth	mamPriV	Palkopoulou et al. 2015
Orycteropus afer	Aardvark	oryAfe2	DNAZoo
Palaeoloxodon antiquus	Straight tusked elephant	palAntN	Palkopoulou et al. 2018
Procavia capensis	Rock hyrax	proCap-Pcap_2.0_HiC	DNAZoo
Trichechus manatus latirostris	Manatee	triManLat2	DNAZoo
Homo sapiens	Human	hg38	NCBI: hg38

Port Table Configuration for BLAT/gfServer/gfClient

To use BLAT and gfServer/gfClient, you must specify a port for each gfServer. Additionally, various parts of this pipeline require interconverting between species names and genome assemblies. To facilitate this, a file has been included named portTable.csv in the data folder. Various genomes and species have already been included in this file along with suggested ports. If on your system, these ports are used by other processes, they can be changed without issue.

Query Files

You should save all query sequences in data/input. Note that currently the pipeline assumes a FASTA input sequence with the extension ".fa", however, this can be easily changed in rules/RecSearch.smk. For convenience, the AvA.fa file containing the master list of sequences is included in data/input.

SraRunTable

Our pipeline will generate de novo transcriptomes for target genomes using SRA identifiers and the HISAT2-StringTie pipeline. A table of SRAs for Loxodonta africana, Trichechus manatus, and Dasypus novemcinctus is included in data/SraRunTable/SraRunTable.csv.

Other lines of evidence for Reciprocal Best-Hits (optional)

In addition to performing RBH Searches, this pipeline can intersect the hits with other lines of evidence to validate the results, and return a list of "evidenced" hits. To do so, simply download the other evidence as either a BED or a GFF file into either data/BED or data/GFF, respectively.

Usage

I highly suggest reading both the snakemake tutorial and looking through the different snakefiles in the rules/ folder to familiarize yourself with the rules.

While it is possible to run the entire pipeline from start to end using snakemake --use-conda publication/manuscript.pdf, I would strongly recommend the following execution order:

Generate the modified hg38 for reciprocal best hit searches using your query file of interest:
snakemake --use-conda output/recBlastDBPrep/hg38_maskRep_noVarChr_fragWithGenes.2bit
Next, confirm that RecSearch runs correctly for one genome:
snakemake --use-conda -npr output/loxAfr4/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/RBB/loxAfr4_RecBlastOutput.bed.rbb
Generate the transcriptomic evidence for functional duplicates: snakemake --use-conda -npr data/BED/loxAfr4-finalGuide.bed

Once these steps are troubleshooted, it is possible to repeat the RecSearch and evidence steps individually for each genome; or continue to the next step and allow snakemake to generate them automatically.

Generate the table of genes per genome with copy number and ECNC scores (plus required files):
snakemake --use-conda -npr output/geneCopyTable/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/atlantogenata-GeneCopyTable_{RBB,evidenced}_filtered_long.csv
Generate the maximum-likelihood tree for ancestral gene copy numbers:
snakemake --use-conda -npr output/iqtree/AvA-pcScore0.1_pcIdent0.8_pcQuerySpan0.5_reverse-hg38_maskRep_noVarChr_fragWithGenes/maxLikelihood_model_MK+FQ+I+G4-dataType_MORPH-asrMin_0.8/atlantogenata_RBB_filtered_dyn.iqtree
Generate the final manuscript:
snakemake --use-conda -npr publication/manuscript.pdf

Issues

If you run into any issues with this pipeline, or see that there are incomplete rules, please submit an Issue using a reproducible example; including all error codes, log files, and any leads or investigative work that you did will go a long way towards making this manuscript and workflow even better, faster! While I can't guarantee that this will work on any computer, I can at least attest that it works well on Linux and Mac systems, in addition to SLURM clusters.

docmanny / smRecSearch