Strobemers

A repository for generating strobemers and evaluation. A preprint of strobemers is found here.

This repository

The repository consists of

a C++ library
a python library
a tool StrobeMap implemented in both C++ and Python.

The C++ library strobemers_cpp/index.[cpp/hpp] contains functions for creating randstobes (order 2 and 3), hybridstrobes (order 2 and 3) and minstrobes (order 2). The python library indexing.py contains functions and generators for creating all strobemers of order 2 and 3.

The tool StrobeMap is a program which roughly has the same interface as MUMmer. StrobeMap takes a reference and queries file in fasta or fastq format. It produces NAMs (Non-overlapping Approximate Matches) between the queries and references and outputs them in a format simular to nucmer/MUMmer. See preprint for definition of NAMs.

Other implementations of strobemers

Other strobemer implementations are found here

The construction time is dependent on the implementation. The times reported in the preprint are for the C++/Python implementations in this repository.

C++ functions

You can copy the index.cpp and index.hpp files in the strobemers_cpp folder in this repository if you want to use either randstrobes (order 2 and 3), hybridstrobes (order 2), or minstrobes (order 2) in your project. The implementation of these functions uses bitpacking and some other clever tricks (inspired by this repo) to be fast. Because of bitpacking, the limitation is that any single strobe cannot be lager than 32, which means that the maximum strobemer length allowed in this implementation is 3*32 = 96 for strobemers of order 3, and 2*32 = 64 for order 2. This should be large enough for most applications.

The functions in the index.cpp file can be used as follows:

#include "index.hpp"

typedef std::vector< std::tuple<uint64_t, unsigned int, unsigned int, unsigned int, unsigned int>> strobes_vector;
strobes_vector randstrobes3; // (kmer hash value, seq_id, strobe1_pos, strobe2_pos, strobe3_pos)
seq = "ACGCGTACGAATCACGCCGGGTGTGTGTGATCGGGGCTATCAGCTACGTACTATGCTAGCTACGGACGGCGATTTTTTTTCATATCGTACGCTAGCTAGCTAGCTGCGATCGATTCG";
n=3;
k=15;
w_min=16;
w_max=30;
seq_id = 0; // using integers for compactness, you can store a vector with accessions v = [acc_chr1, acc_chr2,...] then seq_id = 0 means v[0].

randstrobes3 = seq_to_randstrobes3(n, k, w_min, w_max, seq, seq_id);
for (auto &t : randstrobes3) // iterate over the strobemer tuples
{
strobemer_hash = std::get<0>(t);
strobe1_pos = std::get<2>(t);
strobe2_pos = std::get<3>(t);
strobe3_pos = std::get<4>(t);
// if you want the actual strobemer sequences:
randstrobe = seq.substr(strobe1_pos, k) + seq.substr(strobe2_pos, k)+ seq.substr(strobe3_pos, k);
}

If you are using some of seq_to_randstrobes2, seq_to_hybridstrobes2, or seq_to_minstrobes3 they return the same vector tuples but position of strobe 2 copied twice, i.e., (kmer hash value, seq_id, strobe1_pos, strobe2_pos, strobe2_pos). There is no reason for this besides it helped me call the functions in the same way.

My benchmarking is saying that randstrobes is roughly as fast as hybridstrobes and minstrobes, and that randstrobes is unexpectedly fast in this implementation in general, about 1.5-3 times slower than generating k-mers for randstrobes of (n=2, s=15, w_min=16,w_max=70). What takes time is pushing the tuples to vector and not computing the strobemers. But more detailed investigation will follow.

Notes for sequence mapping

The preprint describes shrinking the windows at ends of sequences to assure similar number of strobemers and k-mers created. For, e.g., read mapping, there is little to no need to shrink windows. This is because if we modify windows at the ends, the windows used to extract the strobemer from the read and the reference will be different. Therefore, the strobemers at the ends are not guaranteed to match the reference, as first described in this issue. Therefore, in my implementation, there will be n - (k + w_min) +1 strobemers of order 2 generated form a sequence of length n, and n - (k + w_max + w_min) +1 strobemers of order 3. In other words, we will only slide last strobe's window outside the sequence. Once it is fully outside the sequence we stop (illistrated in approach B for order 2 in here.

Python functions

The indexing.py module located in the modules folder contains functions for generating k-mers, spaced k-mers, minimizers, and strobemers (minstrobes, hybridstrobes, and randstrobes) of order 2 and 3. For randstrobes, there are two ways to create them. The first way is with the function randstrobes, which takes a string, k-mer size, and upper and lower window limits and returns a dictionary with positions of the strobes as keys and the hash value of the randstrobe sequence (strings) as values. For example

from modules import indexing
all_mers = defaultdict(list)
for (p1,p2,p3), h in indexing.randstrobes(seq, k_size, w_min, w_max, order = 3).items():
    # all_mers is a dictionary with hash values as keys and 
    # a list with position-tuples of where the strobemer is sampled from
    all_mers[h].append( (p1,p2,p3) )

Functions minstrobes and hybridstrobes have the same interface.

The second way is to call randstrobes_iter which is a generator. Similarly to randstrobes, randstrobes_iter takes a string, k-mer size, and upper and lower window size, but instead yields randstrobes from the sequence and is not as memmory requiring as the randstrobes function which store and returns all the strobes in a dictionary. randstrobes_iter generating randpers of order 2 can be used as follows

from modules import indexing
all_mers = defaultdict(list)
for (p1,p2), s in indexing.randstrobes_iter(seq, k_size, w_min, w_max, order = 2, buffer_size = 1000000):
    all_mers[s].append( (p1,p2) )

Functions minstrobes_iter and hybridstrobes_iter have the same interface.

StrobeMap (C++)

Installation

You can acquire precompiled binaries for Linux and Mac OSx from here. For example, for linux, simply do

wget https://github.com/ksahlin/strobemers/raw/main/strobemers_cpp/binaries/Linux/StrobeMap-0.0.2
chmod +x StrobeMap-0.0.2
./StrobeMap-0.0.2  # test program

If you want to compile from the source, you need to have a newer g++ and zlib installed. Then do the following:

git clone https://github.com/ksahlin/strobemers
cd strobemers/strobemers_cpp/

# Needs a newer g++ version. Tested with version 8 and upwards.

g++ -std=c++14 main.cpp index.cpp -lz -fopenmp -o StrobeMap -O3 -mavx2

If zlib is not already installed on your system, it can be installed through, e.g., conda by

conda install -c anaconda zlib

If you dont have conda, download and install here.

Usage

$ ./StrobeMap 

StrobeMap VERSION 0.0.2

StrobeMap [options] <references.fasta> <queries.fast[a/q]>
options:
  -n INT number of strobes [2]
  -k INT strobe length, limited to 32 [20]
  -v INT strobe w_min offset [k+1]
  -w INT strobe w_max offset [70]
  -t INT number of threads [3]
  -o name of output tsv-file [output.tsv]
  -c Choice of protocol to use; kmers, minstrobes, hybridstrobes, randstrobes [randstrobes]. 
  -s Split output into one file per thread and forward/reverse complement mappings. 
     This option is used to generate format compatible with uLTRA long-read RNA aligner and requires 
     option -o to be specified as a folder path to uLTRA output directory, e.g., -o /my/path/to/uLTRA_output/

# randstrobes (3,30,31,60)
StrobeMap -k 30 -n 3 -v 31 -w 60 -c randstrobes -o mapped.tsv  ref.fa query.fa

Common installation from source errors

If you have zlib installed, and the zlib.h file is in folder /path/to/zlib/include and the libz.so file in /path/to/zlib/lib but you get

main.cpp:12:10: fatal error: zlib.h: No such file or directory
 #include <zlib.h>
          ^~~~~~~~
compilation terminated.

add -I/path/to/zlib/include -L/path/to/zlib/lib to the compilation, that is

g++ -std=c++14 -I/path/to/zlib/include -L/path/to/zlib/lib main.cpp index.cpp -lz -fopenmp -o StrobeMap -O3 -mavx2

StrobeMap (Python)

StrobeMap implements order 2 and 3 hybridstrobes (default), randstrobes, minstrobes, as well as kmers. The tool produces NAMs (Non-overlapping Approximate Matches; see explanation in preprint) for both strobemers and kmers. Test data is found in the folder data in this repository. Here are some example uses:

# Generate hybridstrobe matches (hybridstrobe parametrization (2,15,20,70)) 
# between ONT SIRV reads and the true reference sequences

./StrobeMap --queries data/sirv_transcripts.fasta \
           --references data/ONT_sirv_cDNA_seqs.fasta \
           --outfolder strobemer_output/  --k 15 
           --strobe_w_min_offset 20 --strobe_w_max_offset 70


# Generate kmer matches (k=30) 
# between ONT SIRV reads and the true reference sequences

./StrobeMap --queries data/sirv_transcripts.fasta \
           --references data/ONT_sirv_cDNA_seqs.fasta \
           --outfolder kmer_output/  --k 30 --kmer_index

# Reads vs reads matching using randstrobes

./StrobeMap --queries data/ONT_sirv_cDNA_seqs.fasta \
           --references data/ONT_sirv_cDNA_seqs.fasta \
           --outfolder strobemer_output/ --k 15 \
           --strobe_w_min_offset 20 --strobe_w_max_offset 70 \
           --randstrobe_index

Minstrobes has the same parameters as hybridstrobes and randstrobes but are invoked with parameter --minstrobe_index

Output

The output is a file matches.tsv in the output folder. You can se a custom outfile name with the parameter --prefix. Output format is a tab separated file on the same format as MUMmer, with identical fields except the last one which is approximate reference sequence match length instead of what MUMmer produce:

>query_accession
ref_id  ref_pos query_pos   match_length_on_reference

Small example output from aligning sirv reads to transcripts (from the commands above) which also highlights the stobemers strength compared to kmers. While kmers can give a more nuanced differentiation (compare read hits to SIRV606 and SIRV616) both the sequences are good candidates for downstream processing. In this small example, the strobemers produce fewer hits/less output needed for post clustering of matches, e.g., for downstream clustering/alignment/mapping. Notice that randstobe hit positions are currently not deterministic due to hash seed is set at each new pyhon instantiation. I will fix the hash seed in future implementations.

Randstrobes (2,15,20,70)

>41:650|d00e6247-9de6-485c-9b44-806023c51f13
SIRV606 35      92      487
SIRV616 35      92      473
>56:954|a23755a1-d138-489e-8efb-f119e679daf4
SIRV509 3       3       515
SIRV509 520     529     214
SIRV509 762     767     121
>106:777|0f79c12f-efed-4548-8fcc-49657f97a126
SIRV404 53      131     535

kmers (k=30)

>41:650|d00e6247-9de6-485c-9b44-806023c51f13
SIRV606 33      90      46
SIRV606 92      150     125
SIRV606 219     275     81
SIRV606 349     408     70
SIRV606 420     479     47
SIRV606 481     540     42
SIRV616 33      90      46
SIRV616 92      150     125
SIRV616 219     275     81
SIRV616 349     408     60
SIRV616 409     482     44
SIRV616 467     540     42
>56:954|a23755a1-d138-489e-8efb-f119e679daf4
SIRV509 68      72      141
SIRV509 230     233     100
SIRV509 331     335     105
SIRV509 435     442     40
SIRV509 475     483     36
SIRV509 579     585     41
SIRV509 621     627     46
SIRV509 695     701     44
SIRV509 812     815     53
>106:777|0f79c12f-efed-4548-8fcc-49657f97a126
SIRV404 53      131     58
SIRV404 128     208     127
SIRV404 283     364     30
SIRV404 422     494     142

CREDITS

Kristoffer Sahlin, Strobemers: an alternative to k-mers for sequence comparison, bioRxiv 2021.01.28.428549; doi: https://doi.org/10.1101/2021.01.28.428549

Preprint found here

sguizard / strobemers