VarGeno-Lite

This is the memory lite version of VarGeno

Prerequisites

A modern, C++11 ready compiler, such as g++ version 4.9 or higher.
The cmake build system (only necessary to install SDSL library. If SDSL library already installed, cmake is not needed)
A 64-bit operating system. Either Mac OS X or Linux are currently supported.

Quick Install

Please install VarGeno before installing VarGeno-Lite

To install VarGeno Lite version:

cd vargeno/vargeno_lite
make all

You should then see vargeno_lite, gbf_lite in vargeno/vargeno_lite directory.

Quick Usage

VarGeno takes as input:

A reference genome sequence in FASTA file format.
A list of SNPs to be genotyped, in UCSC text file format. VCF format support coming soon
Sequencing reads from the donor genome in FASTQ file format. If you have multiple FASTQ files, please cat them into one file.

Before genotyping an individual, you must construct indices for the reference using the following commands:

vargeno_lite ucscd ref.fa snp.txt ref.dict snp.dict
vargeno_lite filt ref.dict snp_pos ref.filt.dict
gbf_lite snp ref.fa snp.txt snp.bf

This constructs the reference dictionaries ref.dict and snp.dict, the reference Bloom filters ref.bf and snp.bf, and also a file with the chromosome lengths ref.fa.chrlens.

To perform the genotyping:

vargeno_lite geno ref.filt.dict snp.dict reads.fq ref.fa.chrlens ref.filt.bf snp.bf result.out

Output format

VarGeno variant genotyping output files contains 4 tab-separated fields for each SNP:

chromosome id
genome position (1-based): The first two fields together uniquely identify a SNP in the input SNP list.
genotypes: 0/0, 0/1 or 1/1
quality score in [0,1]: higher quality score means more confident genotyping result

Example

In this example, we genotype 100 SNPs on human chromosome 22 with a small subset of 1000 Genome Project Illumina sequencing reads. The whole process should finish in around a minute and requries XX GB RAM. Suppose VarGeno-Lite is installed in directory $VARGENO_LITE.

go to test data directory

cd $VARGENO_LITE/../test

pre-process the reference and SNP list to generate indices:

$VARGENO_LITE/vargeno_lite ucscd chr22.fa snp.txt ref.dict snp.dict

generate lite version dictionary and Bloom filter

$VARGENO_LITE/vargeno_lite filt ref.dict snp_pos ref.filt.dict

Note this command will automatically generate the lite version Bloom filter named ref.filt.bf

generate SNP Bloom filter

$VARGENO_LITE/gbf_lite snp chr22.fa snp.txt snp.bf

genotype variants:

$VARGENOLITE/vargeno_lite geno ref.filt.dict snp.dict reads.fq chr22.fa.chrlens ref.filt.bf snp.bf result.out

[Warning] The dictionaries and Bloom filters generated by VarGeno is not compatible with VarGeno-Lite.

Citation

If you use VarGeno in your research, please cite

Chen Sun and Paul Medvedev, Accelerating SNP genotyping from whole genome sequencing data for bedside diagnostics

VarGeno's algorithm is built on top of LAVA's. Its code is built on top of LAVA's and it reuses a lot of LAVA's code. It uses some code from the AllSome project.

Shajii A, Yorukoglu D, William Yu Y, Berger B, Fast genotyping of known SNPs through approximate k-mer matching, Bioinformatics. 2016 32(17):i538-i544. Code is available here.

medvedevgroup / vargeno_lite