BIGtigr / edyeet

base-accurate DNA sequence alignments using edlib and mashmap2

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

edyeet

edyeet is a fork of MashMap that implements base-level alignment using edlib. It completes an alignment module in MashMap and extends it to enable multithreaded operation. A single command-line interface simplfies usage. The PAF output format is harmonized and made equivalent to that in minimap2, and has been validated as input to seqwish.

process

Each query sequence is broken into pieces defined by -s[N], --segment-length=[N]. These segments are then mapped using MashMap's sliding minhash mapping algorithm and subsequent filtering steps. To reduce memory, a temporary file is used to store initial mappings. Each mapping location is then used as a target for alignment using edlib.

The resulting alignments always contain extended CIGARs in the cg:Z:* tag. Approximate mapping (equivalent to MashMap) can be obtained with -m, --approx-map.

Mapping merging is disabled by default, as aligning merged approximate mappings with edlib under reasonable identity bounds can generate very long runtimes. However, merging can be useful in some settings and is enabled with -M, --merge-mappings.

Sketching, mapping, and alignment are all run in parallel using a configurable number of threads. The number of threads must be set manually, using -t, and defaults to 1.

usage

edyeet has been developed to accelerate the alignment step in variation graph induction (the first step in the seqwish / smoothxg pipeline). Suitable default settings are provided for this purpose.

Four parameters shape the length, number, and identity of the resulting mappings:

  • -s[N], --segment-length=[N] is the length of the mapped and aligned segment
  • -p[%], --map-pct-id=[%] is the percentage identity minimum in the mapping step
  • -n[N], --n-secondary=[N] is the maximum number of mappings (and alignments) to report for each segment
  • -a[%], --align-pct-id=[%] defines the minimum percentage identity allowed in the alignment step

Together, these settings allow us to precisely define an alignment space to consider. During all-to-all mapping, -X can additionally help us by removing self mappings from the reported set.

examples

Map a set of query sequences against a reference genome:

edyeet reference.fa query.fa >aln.paf

Setting a longer segment length to reduce spurious alignment:

edyeet -s 50000 reference.fa query.fa >aln.paf

Self-mapping of sequences:

edyeet -X query.fa query.fa >aln.paf

installation

Follow INSTALL.txt to compile and install edyeet.

publications

About

base-accurate DNA sequence alignments using edlib and mashmap2

License:Other


Languages

Language:C++ 90.1%Language:Perl 4.8%Language:C 4.5%Language:Python 0.3%Language:M4 0.2%Language:Makefile 0.1%Language:Shell 0.0%