A toolkit for efficiently processing optical maps and such
Tools may eventually include, in order:
Existing features:
- .bnx and .cmap format parsers
- in silico digestion -> cmap
- alignment-based labeling (BAM) -> cmap
- in silico optical mapping (simulation) w/some error profile
- feature-based molecule/cmap alignment with DTW refinement
Future features:
- constructing consensus maps from pairwise molecule alignments
- structural variant prediction from alignments
All code is distributed under the MIT license.
Some BNX parsing and associated data structures were borrowed from https://github.com/yanlinlin82/bntools but have been significantly modified.
- zlib (https://zlib.net/)
- htslib (https://github.com/samtools/htslib)
git clone https://github.com/txje/rekit
cd rekit/src
git clone https://github.com/attractivechaos/klib
cd ..
make
Usage: rekit [command] [options]
Commands:
align: align BNX molecules to reference CMAP
simulate: simulate molecules
digest: in silico digestion
label: produce alignment-based reference CMAP
Options:
align -bc
simulate -frx --break-rate --fn --fp --min-frag --stretch-mean --stretch-std --source-output
digest -fr
label -a
-b: bnx: A single BNX file containing molecules
-c: cmap: A single CMAP file
-f: fasta: Reference sequence to simulate from
-a: bam: BAM alignment file
-r: cutseq: Recognition/label site sequence
-q: Size of q-gram/k-mer to hash (default: 4)
-h: Number of hash functions to apply
-t: Minimum number of q-gram/cross-ratio anchors in a chain (default: 1)
-m: max_qgram_hits: Maximum occurrences of a q-gram before it is considered repetitive and ignored
-d: DTW score threshold to report alignment (default: 0.001)
-x: Simulated molecule coverage
simulate options (defaults based on empirical Saphyr data):
--break-rate: Probability of genome fragmentation per locus (default: 0.000005)
--fn: Probability of missed label at true restriction site (default: 0.09893)
--fp: Probability of false-positive label (default: 0.07558)
--stretch-mean: Fragment stretch mean (default: 0.991385)
--stretch-std: Fragment stretch standard deviation (default: 0.033733)
--min-frag: Minimum detectable fragment size (default: 500)
-s, --source-output: Output the reference positions of the simulated molecules to the given file
label options:
--coverage-threshold: Read coverage required (in ~300bp window) to call a label site (default: 10)
To simulate 100x coverage from a reference genome (fasta) using the DLE-1 recognition site:
rekit simulate -f <fasta> -r CTTAAG -x 100 -s <output_truth> > <output_bnx>
Simulate 10x coverage from the human reference genome with DLE-1 (should take <1 minute):
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.39_GRCh38.p13/GCF_000001405.39_GRCh38.p13_genomic.fna.gz
rekit simulate -f GCF_000001405.39_GRCh38.p13_genomic.fna.gz -r CTTAAG -x 10 -s GRCh38_rekit_10x_truth.tsv > GRCh38_rekit_10x.bnx
Tab-delimited text to stdout with the following fields:
- Query (molecule) ID
- Reference map ID
- Reversed? (query)
- Query labels start index
- Query labels end index
- Query labels length
- Query label start position
- Query label end position
- Query total length
- Ref labels start index
- Ref labels end index
- Ref labels length
- Ref label start position
- Ref label end position
- Ref total length
- DTW alignment score
- DTW path string {'.', 'D', 'I'}