SRTk is a collection of scripts to find the split-read alignments of sequences using an inexact algorithm. SRTk is under active development and we hope to release it as part of a larger package in upcoming months.
STRBait should work on any standard 64 bit Linux environment with
- GCC
- Python version >= 2.7
- Cython
The following two python libraries should be installed on the system
- pysam
- numpy
SRTk uses LASTZ (http://www.bx.psu.edu/~rsharris/lastz/newer) >= 1.03.66 to align the sequences to the reference genome.
python, gcc are normally pre-installed on most Linux systems. If not, please go ahead and install them. Cython is required as well, and might need to be installed on some systems.
Please do the following in the top level directory of the distribution:
make
This should create a bin
folder and copy all the required binaries to it.
Please make sure that following files are in the bin
folder:
align.sh
align_reads_with_lz
combine_alignments
combine_alignments.so
find_best_split
find_best_split.so
sr_sam
sr_sam.so
SRTk is a collection of scripts to output a SAM file with the split-read alignments of sequences in a sample. The script of interest is called sr_sam
Find the best split alignments for reads that carry signatures of SVs.
usage:
sr_sam [options] reference.fa [splitters.bam] unmapped.fq.gz
where the options are:
-h,--help : print usage and quit
-d,--debug : print debug information
-m,--maxsplits : allow up to maxsplits primary and supplemental
alignments [2]
-c,--coverage : require at least this fraction of the read to be aligned
in the primary alignment [0.5]
-l,--lastz : the path to the LASTZ (32 bit version) binary [lastz_32]
-t,--threads : number of threads to use [1]
-1,--onlylz : only use LASTZ. default is to use both BWA and LASTZ
-v,--version : print version of the code and exit
1. reference.fa is the reference sequence in fasta format.
2. splitters.bam is a indexed BAM file with the clipped reads
3. unmapped.fq.gz is a zipped file of sequences that did not align
to the reference
This script collects all the split read alignments from BWA or some other aligner. It also uses LASTZ to align the unmapped reads to the reference, and outputs the best set of split alignments that cover the query. The output can then be given as input to LUMPY to find the SVs.
A test dataset is provided with the distribution in the tests
folder.
Run
make
and that should run some simple tests to make sure that the program is working as expected.