yhwu / rsicnv

Detect copy number variations based on read depth of whole genome sequencing

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Synopsis

RSICNV detects copy number variations based on read depths of high coverage whole genome sequencing using the robust segment identification algorithm with negative binomial transformations.

Code Example

rsicnv                                   #help
rsicnv rsi  -f $REF -b $BAM -o $CNVFILE  #detect CNV from a BAM file
rsicnv plot -f $REF -b $BAM -v $CNVFILE  #plot CNVs in $CNVFILE

Installation

If you have GIT installed, you can download and compile the codes with

git clone https://github.com/yhwu/rsicnv.git
cd rsicnv 
make

or you can download the files directly with

wget https://github.com/yhwu/rsicnv/archive/master.zip -O rsicnv.zip
unzip rsicnv.zip 
cd rsicnv-master
make

Note: you will need g++, gcc compilers, and -lm -lz libs to compile the codes. SAMTOOLS and ALGLIB are included in the download as rsicnv needs their libs. They are not needed afterward.

Tests

This program requires a BAM file and the corresponding reference fasta file. The BAM file must be sorted and both must be indexed by samtools.

#1. download a test bam file
wget ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot2_high_cov_GRCh37_bams/data/NA12878/alignment/NA12878.chrom19.ILLUMINA.bwa.CEU.high_coverage.20100311.bam*

#2. download hg19(b37) reference
wget ftp://ftp.sanger.ac.uk/pub/1000genomes/tk2/main_project_reference/human_g1k_v37.fasta.gz
gunzip -c human_g1k_v37.fasta.gz > b37.fasta
samtools faidx b37.fasta

#3. detect CNV
rsicnv rsi  -f b37.fasta -b NA12878.chrom19.ILLUMINA.bwa.CEU.high_coverage.20100311.bam -o test.19.txt

#4. plot CNV
rsicnv plot -f b37.fasta -b NA12878.chrom19.ILLUMINA.bwa.CEU.high_coverage.20100311.bam -v test.19.txt

#5. check overlap with a reference CNV set
checkbp.pl        NA12878hg19_cleaned.txt test.19.txt
checkbp.pl -p 200 NA12878hg19_cleaned.txt test.19.txt

Output fields

1 CHROM chromosome name
2 START start(lower) position, 1 based
3 END end(higher) position, 1 based
4 TYPE DEL or DUP
5 SCORE Phred quality score
6 LENGTH length of variation
7 CNV_MED median of read depths in CNV region
7 CNV_SD standard deviation of read depths in CNV region
7 NEIGHBOR_MED median of read depths before and after CNV region, twice as long as CNV before and twice after
7 NEIGHBOR_RUNMEANSD standard deviation of read depths in neighboring regions
7 CHR_MED median of read depths of whole chromosome
7 CHR_SD standard deviation of read depths of whole chromosome
8 RP number of read pairs overlap with CNV region with at least 50% common to both, read pairs must also be outliers(3sd away) in terms of insert length
8 Q0 fraction of mapq=0 reads in CNV region
9 METHOD RSI

System requirement

Linux. Max v_mem is less than 1GB. Running time should be 1 or 2 minutes longer than samtools mpile without the -f option.

Help

[yhwu@debian rsicnv]$ ./rsicnv 
Usage:

1. detect CNV

   rsicnv rsi <options> [-b BAMFILE | -d RDFILE -c RNAME ] -f REFFILE 

Options:
   -m   INT  bin size, default=101
   -q   INT  minimum mapping quality, default=0
   -Q   INT  minimum base quality, default=10
   -cap INT  cap read depth at INT*median, dafault=4
             if INT<0, do not cap read depth
   -NOGC     do not adjust GC content, default=adjust
   -MED      only use median transformation 
   -NB       only use negative binomial transformation (default)
   -s        save read depths before capping and GC adjustment 
   -o   STR  output file, default=rsiout.txt 
   -p   STR  output plot folder, default=cnv_plots
   -np       do not plot CNV

Note:
   The reference file that was used for mapping is needed.
   Input can be given either as a BAM file or a read depth file. 
In the latter case, the chromosome name must be given with -c RNAME. 
A read depth file can be generated with samtools mpileup BAM | cut -f2,4  
keeping only the position and read depth fields. If -c RNAME is given with 
-b BAMFILE, only chromosome RNAME will be processed. By default, samtools 
mpileup does not count duplicated reads, bases with base quality less than 
13, or paired reads far apart. Rsicnv adopts the same rules except the last 
one.

Example:
    rsicnv rsi -f b37.fasta -b $BAM -o all.rsi
    rsicnv rsi -f b37.fasta -b $BAM -c $CHR -o $CHR.rsi
    rsicnv rsi -f b37.fasta -d $RD  -c $CHR -o $CHR.rsi

2. plot CNV

   rsicnv plot <options> [-b BAMFILE | -d RDFILE -c RNAME ] -f REFFILE -v CNVFILE 

Options:
   -p   STR  output folder, default=cnv_plots

Note:
Rsicnv requires gnuplot to plot the figures. The figures are saved in 
./cnv_plots directory in postscript format. If ImageMagic is available, 
the ps files will be converted to the png format. GC contents are not 
adjusted.

Example:
    rsicnv plot -f b37.fasta -b $BAM -v $CNV -p $PLOTFOLDER

AUTHOR

Yinghua Wu, Department of Biostistics and Epidemiology, University of Pennsylvania, Philadelphia, PA 19104

License

GNU General Public License, version 3.0 (GPLv3)

About

Detect copy number variations based on read depth of whole genome sequencing


Languages

Language:C++ 86.9%Language:C 12.0%Language:Perl 0.9%Language:Java 0.1%Language:Python 0.1%Language:TeX 0.0%