fasta c-plus-plus bioinformatics-tool contigs multifasta

unmulti - extract individual sequeunces from a fasta file

unmulti is a tool for splitting fasta files containing several sequences into many files containing just one sequence, or for extracting a list of sequences from the same file. The tool supports input compression in .gz, .bz2, or .xz formats, and output compression in .gz format.

Installation

Prebuilt binaries

Prebuilt binaries for generix linux_x86-64 are available from the releases page.

Compiling from source

Requirements

c++11 compliant compiler.
cmake v2.8.2 or newer.
git.
zlib
Internet access.

How-to

Clone the repository, enter the directory and run

mkdir build
cd build
cmake ..
make

this will download the necessary dependencies and compile the unmulti executable in build/bin/.

Usage

unmulti -f <input multifasta> -o <output directory (default: working directory)

Example

Split a multifasta

Running unmulti on an input file in.fasta with the following contents

>seq_1
AAACGT
>seq_2
GGGTAC

will produce files 0.fasta and 1.fasta in the output directory with contents

>seq_1
AAACGT

and

>seq_2
GGGTAC

Extract specific sequence(s)

Running unmulti -f in.fasta --extract seq_2 on the example input above will extract only the sequence starting at >seq_2. Multiple sequences can be supplied by delimiting them with ,. Running unmulti -f in.fasta --extract seq_2,seq_1 will extract both sequences from the example input.

Other options

The input file can be supplied compressed in the zlib/libbz2/liblzma format depending on what was supported on the machine that unmulti was compiled on. Adding the --compress toggle will compress the output files using zlib.

Adding the -t number_to_sequence.tsv argument will write a table linking the output filenames to their sequence names to the supplied argument. In the example above, running unmulti -f in.fasta -t number_to_sequence.tsv would produce the following file

0	seq_1
1	seq_2

If your sequeunces begin with some other character than '>', the --seq-start option can be used to change the character. For example, running unmulti -f in.fasta --seq-start @ would make unmulti compatible with a file in the following format

@read_1
CGCCTAC
+
GGFGGCD
@read_2
TGAGCCA
+
FFGFG=G

Accepted flags/parameters

unmulti accepts the following flags/parameters:

-f            Input multifasta.
-o            Output directory (default: working directory)
-t            Write a table linking the output filenames to sequence names to the argument filename.
--compress    Compress the output files with zlib (default: false)
--extract     Extract only the named sequence(s). Multiple sequences should be delimited by ','.
--seq-start   Sequence begin character (default: '>')

License

The source code from this project is subject to the terms of the MIT license. A copy of the MIT license is supplied with the project, or can be obtained at https://opensource.org/licenses/MIT.

About

Split a multifasta file into many fasta files containing a single sequence, or extract a list of sequences/contigs. Supports compressed input and output files.

fasta c-plus-plus bioinformatics-tool contigs multifasta

MIT License

Languages

Language:C++ 64.4%Language:CMake 35.6%