Mesh89 / SurVIndel

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

SurVIndel

Compiling

To be compiled, SurVIndel requires g++ 4.7.2.

htslib 1.7 is included in the source, zipped. First of all, you should build it by using the provided script

./build_htslib.sh

If htslib does not build correctly, please refer to https://github.com/samtools/htslib

Then, run

cmake -DCMAKE_BUILD_TYPE=Release . && make

Preparing the reference

The reference fasta file should be indexed by both bwa and samtools. For example, assuming the file is hg19.fa, you should run

bwa index hg19.fa
samtools faidx hg19.fa

Although not mandatory, SurVIndel will generally give higher quality results if a simple repeats file is provided. This can normally be downloaded from the simpleRepeats table in UCSC. The header must be removed and only the chromosome, the start, the end and the period columns must be retained, i.e.:

cat downloaded-file | grep -v "#" | cut -f2,3,4,6 > file-for-survindel.bed

Alternatively, you can run TRF (https://tandem.bu.edu/trf/trf.html) and use the provided trf-to-bed.sh, i.e.:

cat trf-output.dat | ./trf-to-bed.sh > file-for-survindel.bed

Preparing the BAM file

SurVIndel has currently only been tested using BAM files generated by BWA MEM, therefore we recommend its usage. It should also be run through Picard FixMateInformation (http://broadinstitute.github.io/picard/command-line-overview.html#FixMateInformation); in particular, it should have the MQ and MC tags. Finally, the file should be sorted and indexed ad usual using samtools.

Supposing file.bam is the file resulting from the alignment:

java -jar picard.jar FixMateInformation I=file.bam
samtools sort file.bam > sorted.bam
samtools index sorted.bam

Running

Once the c++ code is compiled, SurVIndel can be run. Python and libraries NumPy (http://www.numpy.org/), PyFaidx (https://github.com/mdshw5/pyfaidx) and PySam (https://github.com/pysam-developers/pysam) are required. Python 2.7, NumPy 1.10, PyFaidx 0.4 and PySam 0.12 are the recommended versions.

The bare minimum command for running SurVIndel is

python surveyor.py /path/to/bamfile /an/empty/working/directory /path/to/reference/fasta

Other parameters which may be important are the number of threads, the location of the bwa and samtools executables and a simple repeats catalogue (can be downloaded from UCSC Genome Browser, or generated by TRF is not present).

python surveyor.py /path/to/bamfile /an/empty/working/directory	/path/to/reference/fasta --threads 40 --samtools /path/to/samtools --bwa /path/to/bwa --simple-rep /path/to/simple/repeats/file

After SurVIndel has been successfully run, the calls can be retrieved with the command

./filter /path/to/working/directory alpha-value score-cutoff min-size simple-repeats

Where alpha-value is the maximum p-value for an indel to be accepted, score-cutoff is the positive-to-negative ratio cutoff, min-size is the minimum size for an indel to be reported and simple-repeats is the simple-repeats file. The recommended values are 0.01 for alpha-value and 0.33 for score-cutoff. Larger alpha-values and lower score-cutoffs will yield more predictions, but at the expense of precision.

About


Languages

Language:C++ 72.7%Language:C 23.0%Language:Python 3.4%Language:CMake 0.8%Language:Shell 0.1%