cvn001 / quanttb

QuantTB is a SNP based method to identify and quantify individual strains present in tuberculosis whole genome sequencing datasets.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

QuantTB

QuantTB is a SNP based method to identify and quantify individual strains present in Tuberculosis whole genome sequencing datasets.

Getting Started

These instructions will guide you through the process of using QuantTB, so that you can deploy it on your own local system. Tested to work for Mac OSX and Ubuntu.

Prerequisites

Python 2.x with development and setup packages needs to be installed on your system to run QuantTB. (https://www.python.org/downloads/) Alternatively python can be installed using a package manager such as miniconda (https://conda.io/miniconda.html).

sudo apt install python
sudo apt install python-setuptools
sudo apt install python-dev

A recent version of Java is needed

sudo add-apt-repository ppa:linuxuprising/java
sudo apt update
sudo apt install oracle-java11-installer

Some functionalities of QuantTB require additional software to be installed on your system.

sudo apt install samtools
sudo apt install bwa

Installing

Download the latest release of QuantTB from https://github.com/AbeelLab/quanttb/releases and install it on your computing environment.

tar -zxvf quanttb-1.01.tar.gz
cd quanttb-1.01
sudo python setup.py install

QuantTB should now be installed. If things do not appear to be working, there is a log file present in the temp directory of the output folder which may help you diagnose problems.

Running QuantTB

QuantTB can be used to classify strains, make a SNP databases, and obtain snp profiles from fastq readsets.

Quantifying individual strains in a sample

To classify strains in a sample using a reference genome, the command 'quant' is used. Quanttb accepts a list of fastq files (-f argument), and vcf files (pilon), or .samp files (-v argument) as input. In addition a reference snp database (.db) needs to be used (-db flag). QuantTB comes prepackaged with a database of 2166 TB genomes that differ by at least 100 snps. This is used as a default if no reference SNP database is supplied. QuantTB classifies strains using an iterative approach. The max number of iterations by default is set to 8, but this can be changed with the '-i' flag.

# Classify a sample from the example data with the default database and save results to results.txt
quanttb quant -f exdata/readset1.fq exdata/readset2.fq -o output1/myresults.txt

A result file containing the references observed in the sample is output to the specified location (default is output/results.txt). The output looks like the table below for a sample containing two strains. Every row in the output denotes the presence of a specific reference snp profile for the corresponding sample. The relative abundances are noted in the column 'relabundance' column.

sample refname totscore relabundance depth
readset1 UT0106 0.577 0.699 5.472
readset1 I0003367-5 0.437 0.301 2.358

QuantTB can optionally find antibiotic resistant variants in the sample using a list of predetermined resistant mutations. Mutations in positions coding for antibiotic resistance can be output with the flag 'abres'

# Classify a sample with manually made database, and output antibiotic resistance results
quanttb quant -f exdata/readset1.fq exdata/readset2.fq -o output2/myresults.txt -abres

Antibiotic resistance results for all samples are output in a separate file, 'antibioticresistances.txt'.

QuantTB can also work directly from pre-computed VCF files, one example is included

gunzip exdata/sample1scnps.vcf
quanttb quant -v sample1snps.vcf -o output3/myresults.txt

QuantTB can also work directly from pre-computed VCF files and use user-defined databases (see below).

quanttb quant -v sample1snps.vcf sample2snps.vcf sample3snps.vcf -db newdb.db -o someoutput/myresults.txt

Making a SNP database

A reference SNP database is required to classify samples. QuantTB comes prepackaged with it's own default database, however the user also has the ability to make their own. A SNP database can be made from different sources. A list of vcf files (.vcf), a list of assembly genomes (.fna/.fa/.fasta), a list of snp files output from MUMmer (.snps), or a list of snpprofiles (.samp) generated by quanttb. The command to make a snp database is 'makesnpdb'. The path to the reference files can be supplied by using the '-g' argument.

# From VCF files
quanttb makesnpdb -g genome1.vcf genome2.vcf genome3.vcf ....

# From fna files. MUMmer must be installed on system
quanttb makesnpdb -g genome1.fna genome2.fna genome3.fna ....

Optionally, during construction of the snp database, quanttb can filter the snps and filter the genomes present to increase classification accuracy. SNPs can be filtered to remove those that are within a certain range of each other per genome using the 'reddist' argument. The default is 25.

# Makes database and removes snps from each genome that are within 50 snps from each other
quanttb makesnpdb -reddist 50 -g genome1.fna genome2.fna genome3.fna ....

A snpdb is output with the '.db' suffix which can be used in the 'quant' command to classify strains within a sample.

Getting variants

To use this functionality correct versions of bwa and samtools need to be installed and on the file path.

wget https://github.com/samtools/samtools/releases/download/1.7/samtools-1.7.tar.bz2 -O - | tar xj ; ( cd samtools-1.7 ; make )
export PATH=/full/path/to/samtools-1.7:${PATH}

wget https://github.com/lh3/bwa/releases/download/v0.7.17/bwa-0.7.17.tar.bz2 -O - | tar xj ; ( cd bwa-0.7.17 ; make )
export PATH=/full/path/to/bwa-0.7.17:${PATH}

Fastq readsets need to be converted to a VCF file in order to be classified or be used as a reference genome in the database. This can optionally be done with the quanttb command: variants. The variants command accepts paired or single end fastq files as input. Reads are variant called against the H37rv genome (Genbank: CP003248.2). For multiple samples, the '-f' argument can be used repeatedly

# For one paired end readset
quanttb variants -f sample_1.fq sample_2.fq

# For two paired end readsets
quanttb variants -f sample_1.fq sample_2.fq  -f sample2_1.fq sample2_2.fq

This outputs a VCF file which can be used as input in snpdb or the quant command.

Full list of command line arguments

Usage: quantTB <command> [options]

Command: makesnpdb       Make a reference SNP database
         quant           Quantify sample with a ref SNP db
         variants        Generate a vcf from sequencing readsets


Usage: quanttb quant [options] <-db> <-s>

Optional arguments:
  -v [VCFSAMPLES ...]
                        VCF(s) or snp profiles that you want tested against
                        the refdb, can either be .vcf(.gz) or .samp
  -f [FASTQ ...]  Fastq Readset(s) that you want tested against refdb,
                        can specify multiple times for multiple pairs, (.fq,
                        .fastq)
  -db DB                Location of reference SNP DB file. (.db) If not
                        supplied a default TB db will be used
  -o OUTPUT             Directory/File where you want results written to
  -resout               Should stats from each run be output?
  -i               Number of iterations for classificaiton
  -abres                Should resistances from each sample be output?
  -k                    Keep temp files?

Usage: quanttb makesnpdb [options] <-g>

Required arguments:
  -g [DBFILES ...]
                        Files you want to use to make the reference database,
                        Can be either .fa, .fna, .fasta, .snps .vcf(.gz), or
                        .samp

Optional arguments:
  -reducedist     When making database, what is the minimum distance
                        between SNPs in a genome
  -o OUTPUT             Directory/File where you want the snpdb to be written
                        to
  -k                    Keep temp files?

Usage: quanttb variants [options] <-f>

Required arguments:
  -f FASTQ [FASTQ ...]  Fastq Readset(s) that you want converted to a vcf
                        (using samtools,bwa and pilon), can specify multiple
                        times for multiple pairs, (.fq, .fastq)

Optional arguments:
  -o OUTPUT             Directory you want vcf written to
  -k                    Keep temp files?

License

This project is licensed under the GNU GENERAL PUBLIC LICENSE- see the LICENSE file for details

Contact

About

QuantTB is a SNP based method to identify and quantify individual strains present in tuberculosis whole genome sequencing datasets.

License:GNU General Public License v3.0


Languages

Language:Python 100.0%