zhangrengang/minimap2

Getting Started

git clone https://github.com/lh3/minimap2
cd minimap2 && make
# long reads against a reference genome
./minimap2 -a test/MT-human.fa test/MT-orang.fa > test.sam
# create an index first and then map
./minimap2 -d MT-human.mmi test/MT-human.fa
./minimap2 -a MT-human.mmi test/MT-orang.fa > test.sam
# long-read overlap (no test data)
./minimap2 -x ava-pb your-reads.fa your-reads.fa > overlaps.paf
# spliced alignment (no test data)
./minimap2 -ax splice ref.fa rna-seq-reads.fa > spliced.sam
# man page for detailed command line options
man ./minimap2.1

Getting Started
Users' Guide
Developers' Guide
Limitations

Users' Guide

Minimap2 is a versatile sequence alignment program that aligns DNA or mRNA sequences against a large reference database. Typical use cases include: (1) mapping PacBio or Oxford Nanopore genomic reads to the human genome; (2) finding overlaps between long reads with error rate up to ~15%; (3) splice-aware alignment of PacBio Iso-Seq or Nanopore cDNA or Direct RNA reads against a reference genome; (4) aligning Illumina single- or paired-end reads; (5) assembly-to-assembly alignment; (6) full-genome alignment between two closely related species with divergence below ~15%.

For ~10kb noisy reads sequences, minimap2 is tens of times faster than mainstream long-read mappers such as BLASR, BWA-MEM, NGMLR and GMAP. It is more accurate on simulated long reads and produces biologically meaningful alignment ready for downstream analyses. For >100bp Illumina short reads, minimap2 is three times as fast as BWA-MEM and Bowtie2, and as accurate on simulated data. Detailed evaluations are available from the minimap2 preprint.

Installation

Minimap2 only works on x86-64 CPUs. You can acquire precompiled binaries from the release page with:

wget --no-check-certificate -O- https://github.com/lh3/minimap2/releases/download/v2.2/minimap2-2.2_x64-linux.tar.bz2 \
  | tar -jxvf -
./minimap2-2.2_x64-linux/minimap2

If you want to compile from the source, you need to have a C compiler, GNU make and zlib development files installed. Then type make in the source code directory to compile. If you see compilation errors, try make sse2only=1 to disable SSE4 code, which will make minimap2 slightly slower.

General usage

Without any options, minimap2 takes a reference database and a query sequence file as input and produce approximate mapping, without base-level alignment (i.e. no CIGAR), in the PAF format:

minimap2 ref.fa query.fq > approx-mapping.paf

You can ask minimap2 to generate CIGAR at the cg tag of PAF with:

minimap2 -c ref.fa query.fq > alignment.paf

or to output alignments in the SAM format:

minimap2 -a ref.fa query.fq > alignment.sam

Minimap2 seamlessly works with gzip'd FASTA and FASTQ formats as input. You don't need to convert between FASTA and FASTQ or decompress gzip'd files first.

For the human reference genome, minimap2 takes a few minutes to generate a minimizer index for the reference before mapping. To reduce indexing time, you can optionally save the index with option -d and replace the reference sequence file with the index file on the minimap2 command line:

minimap2 -d ref.mmi ref.fa                     # indexing
minimap2 -a ref.mmi reads.fq > alignment.sam   # alignment

Importantly, it should be noted that once you build the index, indexing parameters such as -k, -w, -H and -I can't be changed during mapping. If you are running minimap2 for different data types, you will probably need to keep multiple indexes generated with different parameters. This makes minimap2 different from BWA which always uses the same index regardless of query data types.

Use cases

Minimap2 uses the same base algorithm for all applications. However, due to the different data types it supports (e.g. short vs long reads; DNA vs mRNA reads), minimap2 needs to be tuned for optimal performance and accuracy. It is usually recommended to choose a preset with option -x, which sets multiple parameters at the same time. The default setting is the same as map-ont.

Map long noisy genomic reads

minimap2 -ax map-pb  ref.fa pacbio-reads.fq > aln.sam   # for PacBio subreads
minimap2 -ax map-ont ref.fa ont-reads.fq > aln.sam      # for Oxford Nanopore reads

The difference between map-pb and map-ont is that map-pb uses homopolymer-compressed (HPC) minimizers as seeds, while map-ont uses ordinary minimizers as seeds. Emperical evaluation suggests HPC minimizers improve performance and sensitivity when aligning PacBio reads, but hurt when aligning Nanopore reads.

Map long mRNA/cDNA reads

minimap2 -ax splice ref.fa spliced.fq > aln.sam      # strand unknown
minimap2 -ax splice -uf ref.fa spliced.fq > aln.sam  # assuming transcript strand

This command line has been tested on PacBio Iso-Seq reads and Nanopore 2D cDNA reads, and been shown to work with Nanopore 1D Direct RNA reads by others. Like typical RNA-seq mappers, minimap2 represents an intron with the N CIGAR operator. For spliced reads, minimap2 will try to infer the strand relative to transcript and may write the strand to the ts SAM/PAF tag.

Find overlaps between long reads

minimap2 -x ava-pb  reads.fq reads.fq > ovlp.paf    # PacBio read overlap
minimap2 -x ava-ont reads.fq reads.fq > ovlp.paf    # Oxford Nanopore read overlap

Similarly, ava-pb uses HPC minimizers while ava-ont uses ordinary minimizers. It is usually not recommended to perform base-level alignment in the overlapping mode because it is slow and may produce false positive overlaps. However, if performance is not a concern, you may try to add -a or -c anyway.

Map short accurate genomic reads

minimap2 -ax sr ref.fa reads-se.fq > aln.sam           # single-end alignment
minimap2 -ax sr ref.fa read1.fq read2.fq > aln.sam     # paired-end alignment
minimap2 -ax sr ref.fa reads-interleaved.fq > aln.sam  # paired-end alignment

When two read files are specified, minimap2 reads from each file in turn and merge them into an interleaved stream internally. Two reads are considered to be paired if they are adjacent in the input stream and have the same name (with the /[0-9] suffix trimmed if present). Single- and paired-end reads can be mixed.

Minimap2 does not work well with short spliced reads. There are many capable RNA-seq mappers for short reads.

Full genome/assembly alignment

minimap2 -ax asm5 ref.fa asm.fa > aln.sam       # assembly to assembly/ref alignment

For cross-species full-genome alignment, the scoring system needs to be tuned according to the sequence divergence.

Algorithm overview

In the following, minimap2 command line options have a dash ahead and are highlighted in bold. The description may help to tune minimap2 parameters.

Read -I [=4G] reference bases, extract (-k,-w)-minimizers and index them in a hash table.
Read -K [=200M] query bases. For each query sequence, do step 3 through 7:
For each (-k,-w)-minimizer on the query, check against the reference index. If a reference minimizer is not among the top -f [=2e-4] most frequent, collect its the occurrences in the reference, which are called seeds.
Sort seeds by position in the reference. Chain them with dynamic programming. Each chain represents a potential mapping. For read overlapping, report all chains and then go to step 8. For reference mapping, do step 5 through 7:
Let P be the set of primary mappings, which is an empty set initially. For each chain from the best to the worst according to their chaining scores: if on the query, the chain overlaps with a chain in P by --mask-level [=0.5] or higher fraction of the shorter chain, mark the chain as secondary to the chain in P; otherwise, add the chain to P.
Retain all primary mappings. Also retain up to -N [=5] top secondary mappings if their chaining scores are higher than -p [=0.8] of their corresponding primary mappings.
If alignment is requested, filter out an internal seed if it potentially leads to both a long insertion and a long deletion. Extend from the left-most seed. Perform global alignments between internal seeds. Split the chain if the accumulative score along the global alignment drops by -z [=400], disregarding long gaps. Extend from the right-most seed. Output chains and their alignments.
If there are more query sequences in the input, go to step 2 until no more queries are left.
If there are more reference sequences, reopen the query file from the start and go to step 1; otherwise stop.

Cite minimap2

If you use minimap2 in your work, please consider to cite:

Li, H. (2017). Minimap2: fast pairwise alignment for long nucleotide sequences. arXiv:1708.01492

Developers' Guide

Minimap2 is not only a command line tool, but also a programming library. It provides C APIs to build/load index and to align sequences against the index. File example.c demonstrates typical uses of C APIs. Header file minimap.h gives more detailed API documentation. Minimap2 aims to keep APIs in this header stable. File mmpriv.h contains additional private APIs which may be subjected to changes frequently.

This repository also provides Python bindings to a subset of C APIs. File python/README.rst gives the full documentation; python/minimap2.py shows an example. This Python extension, mappy, is also available from PyPI via pip install mappy or from BioConda via conda install -c bioconda mappy.

Limitations

Minimap2 may produce suboptimal alignments through long low-complexity regions where seed positions may be suboptimal. This should not be a big concern because even the optimal alignment may be wrong in such regions.
Minimap2 requires SSE2 instructions to compile. It is possible to add non-SSE2 support, but it would make minimap2 slower by several times.

In general, minimap2 is a young project with most code written since June, 2017. It may have bugs and room for improvements. Bug reports and suggestions are warmly welcomed.

zhangrengang / minimap2