hyphaltip / CNVnator

a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

README file for CNVnator software distribution



1. Compilation
==============

You must install ROOT package (http://root.cern.ch) and set up $ROOTSYS
variable (see ROOT documentation).

$ cd src/samtools
$ make

Even if compilation is not completed, but the file libbam.a has been created, you
can continue.

$ cd ../
$ make 

If make doesn't work, try "make OMP=no" which will disable parallel support.

>>>Installing with Yeppp support
Yeppp (http://www.yeppp.info/) is a library which provides high-performance implementations of math functions.
To install with Yeppp support, download Yeppp from http://bitbucket.org/MDukhan/yeppp/downloads/yeppp-1.0.0.tar.bz2
and extract it at a location of your choice. Set YEPPPLIBDIR and YEPPPINCLUDEDIR directories appropriately. Typically,
for Linux-based systems on x86-64, YEPPPLIBDIR will be yeppp-1.0.0/binaries/linux/x86_64/ and YEPPPINCLUDEDIR will be
yeppp-1.0.0/library/headers. To build, type make YEPPPLIBDIR=... YEPPPINCLUDEDIR=... . To disable OpenMP also add OMP=no
to the make command.

2. Predicting CNV regions
=========================

Running involves a few steps outlined below. Chromosome names and lengths are
parsed from sam/bam file header. One can override this default behavior by 
using the -genome option. 

>>>EXTRACTING READ MAPPING FROM BAM/SAM FILES

$ ./cnvnator [-genome name] -root out.root [-chrom name1 ...] -tree [file1.bam ...]

out.root  -- output ROOT file. See ROOT package documentation.
chr_name1 -- chromosome name.
file.bam  -- bam files.

Chromosome names must be specified the same way as they are described in sam/bam
header, e.g., chrX or X. One can specify multiple chromosomes separated by
space. If no chromosome is specified, read mapping is extracted for all chromosomes
in sam/bam file. Note that this would require machines with a large physical
memory of 7Gb. Extracting read mapping for subsets of chromosomes is a way
around this issue. Also note that the root file is not being overwritten.
To have correct q0 field for CNV calls (see below), one needs to use the
option -unique when extracting read mapping from bam/sam files.

Example:

./cnvnator -root NA12878.root -chrom 1 2 3  -tree NA12878_ali.bam

for bam files with a header like this:
@HD VN:1.4    GO:none  SO:coordinate
@SQ SN:1      LN:249250621
@SQ SN:2      LN:243199373
@SQ SN:3      LN:198022430
...

or

./cnvnator -root NA12878.root -chrom chr1 chr2 chr3 -tree NA12878_ali.bam
for bam files with a header like this:
@HD VN:1.4    GO:none  SO:coordinate
@SQ SN:chr1   LN:249250621
@SQ SN:chr2   LN:243199373
@SQ SN:chr3   LN:198022430
...

Example:

./cnvnator -root NA12878.root -chrom 4 5 6 -tree NA12878_ali.bam
./cnvnator -root NA12878.root -chrom 7 8 9 -tree NA12878_ali.bam

is equivalent to

./cnvnator -root NA12878.root -chrom 4 5 6 7 8 9 -tree NA12878_ali.bam



>>>GENERATING A HISTOGRAM

$ ./cnvnator [-genome name] -root file.root [-chrom name1 ...] -his bin_size [-d dir]

This step is not memory consuming and so can be done for all chromosomes
at once. It can, of course, be carried for a subset of chromosomes also. 
Files with chromosome sequences are required and should reside in the running
directory or in the directory specified by the -d option. Files should be named
as: chr1.fa, chr2.fa, etc.



>>>CALCULATING STATISTICS

$ ./cnvnator -root file.root [-chrom name1 ...] -stat bin_size

This step must be completed before proceeding to partitioning and CNV calling.



>>>RD SIGNAL PARTITIONING

$ ./cnvnator -root file.root [-chrom name1 ...] -partition bin_size [-ngc]

Option -ngc specifies not to use GC corrected RD signal. Partitioning
is the most time consuming step.



>>>CNV CALLING

$ ./cnvnator -root file.root [-chrom name1 ...] -call bin_size [-ngc]

Calls are printed to STDOUT.

The output is as follows:

CNV_type coordinates CNV_size normalized_RD e-val1 e-val2 e-val3 e-val4 q0

normalized_RD -- normalized to 1.
e-val1        -- is calculated using t-test statistics.
e-val2        -- is from the probability of RD values within the region to be in
the tails of a gaussian distribution describing frequencies of RD values in bins.
e-val3        -- same as e-val1 but for the middle of CNV
e-val4        -- same as e-val2 but for the middle of CNV
q0            -- fraction of reads mapped with q0 quality

To have correct output of q0 field one needs to use the option -unique when extracting read mapping from bam/sam files.



>>>MERGING ROOT FILES

./cnvnator [-genome name]-root out.root [-chrom name ...] -merge file1.root ...

Merging can be used when combining read mappings extracted from multiple files.
Note, histogram generation, statistics calculation, signal partitioning, and
CNV calling should be completed/redone after merging.



>>>VISUALIZING SPECIFIED REGIONS

./cnvnator -root file.root [-chrom chr_name1 ...] -view bin_size [-ngc]

Once prompted enter a genomic region, e.g., 
>12:11396601-11436500
 or
>chr12:11396601-11436500
 or 
>12 11396601 11436500
 or
>chr12 11396601 11436500

Additionally, one can specify the length of flanking regions (default is 10 kb) to
be also displayed, e.g.,
>12:11396601-11436500 100000
 or
>chr12:11396601-11436500 100000
 or
>12 11396601 11436500 100000
 or
>chr12 11396601 11436500 100000

One can also perform instant genotyping by adding the word 'genotype', e.g.,
>12:11396601-11436500 genotype
 or
>chr12:11396601-11436500 genotype
 or
>12 11396601 11436500 genotype
 or
>chr12 11396601 11436500 genotype



3. Genotyping genomic regions
=============================

For efficient genotype calculations, we recommend that you sort the list of regions by
chromosomes.

./cnvnator -root file.root -genotype bin_size [-ngc]

Once prompted enter a genomic region, e.g., 
>12:11396601-11436500
 or
>chr12:11396601-11436500
 or 
>12 11396601 11436500
 or
>chr12 11396601 11436500

One can also perform instant visualization by adding the word 'view', e.g.,
>12:11396601-11436500 view
 or
>chr12:11396601-11436500 view
 or
>12 11396601 11436500 view
 or
>chr12 11396601 11436500 view

For genotyping of multiple regions one can use input piping, e.g.,
./cnvnator -root NA12878.root -genotype 100 << EOF
12:11396601-11436500
22:20999401-21300400
exit
EOF

Another example,

awk '{ print $2 } END { print "exit" }' calls.cnvnator | ./cnvnator -root NA12878.root -genotype 100







Please send your comments and suggestions to abyzov.alexej@mayo.edu.

About

a tool for CNV discovery and genotyping from depth-of-coverage by mapped reads

License:Other


Languages

Language:C++ 95.5%Language:Perl 3.1%Language:Makefile 1.4%