Benchmarking of tools for viral haplotype reconstruction.
WARNING: under active construction! May not work as appears, but in such an event feel free to file an issue.
This repo can be used to test a variety of tools and settings to build pipelines for viral haplotype/quasispecies reconstruction on simulated, in vitro and in vivo data.
Haplotyper | Implemented | Paper | Code |
---|---|---|---|
QuasiRecomb | Yes | Link | Link |
aBayesQR | Yes | Link | Link |
SAVAGE | Yes | Link | Link |
RegressHaplo | Yes | Link | Link |
Haploclique | Yes | Link | Link |
SHORAH | No | Link | Link |
PredictHaplo | No | Link | Link |
QSdpR | No | Link | Link |
TenSQR | No | Link | Link |
A standard invocation will be of the form:
snakemake output/$DATASET/$QUALITY_CONTROL/$READ_MAPPER/$GENE/$HAPLOTYPER/haplotypes.fasta
where the variables can take one of the following values:
For more information, see the Data section below.
- LANL simulations: see keys of
simulations.json
for potential simulation names - Reconstruction: see filenames in
reconstruction
directory of input data - Evolution: see filenames in
evolution
directory of input data - Compartmentalization: see entries in
compartmentalization.json
in root directory
qfilt
, fastp
, trimmomatic
bealign
, bowtie2
, bwa
env
, gag
, int
, nef
, pol
, pr
, prrt
, rev
, rt
, tat
, vif
, vpr
quasirecomb
, abayesqr
, savage
, regresshaplo
Use of snakemake permits running on TORQUE, i.e.
snakemake --cluster 'qsub -o ./logs -e ./logs -V -d `pwd` -l nodes=1:ppn=$PPN' -j $JOBS -k target
├── compartmentalization │ ├── $PATIENT_ID/$DATE/$COMPARTMENT/$REPLICATE/reads.fasta │ ├── $PATIENT_ID/$DATE/$COMPARTMENT/$REPLICATE/scores.qual ├── evolution │ ├── ERS661087.fastq │ ├── ERS661088.fastq │ ├── ERS661089.fastq │ ├── ERS661090.fastq │ ├── ERS661091.fastq │ ├── ERS661092.fastq │ └── ERS661093.fastq ├── LANL-HIV-aligned.fasta ├── LANL-HIV.fasta ├── LANL-HIV.new ├── README.md ├── reconstruction │ ├── 3.GAC.454Reads.fna │ ├── 3.GAC.454Reads.qual │ ├── 93US141_100k_14-159320-1GN-0_S16_L001_R1_001.fastq │ ├── 93US141_100k_14-159320-1GN-0_S16_L001_R2_001.fastq │ ├── BP_050100753.fasta │ ├── BP_050100753.qual │ ├── FiveVirusMixIllumina_1.fastq │ ├── FiveVirusMixIllumina_2.fastq │ ├── PP1L_S45_L001_R1_001.fastq │ ├── PP1L_S45_L001_R2_001.fastq │ ├── regress_haplo.bam │ ├── regress_haplo.bam.bai │ ├── sergei1.fastq │ ├── sergei2.fastq │ ├── SRR961514-Illumina.sra │ ├── SRR961596-454.fastq │ ├── SRR961596-454.sra │ ├── SRR961669-PacBio.fastq │ └── SRR961669-PacBio.sra └── references ├── env.fasta ├── gag.fasta ├── int.fasta ├── nef.fasta ├── pol.fasta ├── pr.fasta ├── prrt.fasta ├── rev.fasta ├── rt.fasta ├── tat.fasta ├── vif.fasta ├── vpr.fasta └── vpu.fasta
LANL-HIV.fasta
LANL-HIV-aligned.fasta
LANL-HIV.new
HIV genomes from the LANL database, as well as an alignment built with mafft
and a tree built with FastTree
. Used for simulation.
evolution/ERS6610*.fastq
NGS read data from a study on HIV intra-host evolution.
reconstruction/BP_050100753.fasta
reconstruction/BP_050100753.qual
ACME lab 454 data which shows a clear signal of segregating haplotypes.
reconstruction/SRR961514-Illumina.fastq
reconstruction/SRR961596-454.fastq
reconstruction/SRR961669-PacBio.fastq
A gold standard dataset, consisting of mixed, known strains at known proportions.
reconstruction/sergei1.fastq
reconstruction/sergei2.fastq
A set of paired end reads given by Sergei.
reconstruction/regress_haplo.bam
reconstruction/regress_haplo.bam.bai
Dataset that comes with the RegressHaplo code.
references/*.fasta
HXB2 genes to be used as references when aligning reads.
- Linux (tested on Ubuntu 18.04.1, CentOS 7)
- conda (tested on 4.6.10) with standard BioConda channels
Further requirements listed in environment.yml
.
conda env create -f environment.yml
conda activate haplotype-reconstruction