medvedevgroup / UST

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

UST

UST is a bioinformatics tool for constructing a spectrum-preserving string set (SPSS) representation from sets of k-mers.

Note: This software has been subsumed by ESSCompress. To use UST, download ESSCompress and follow the UST instructions in the README.

Requirements

GCC >= 4.8 or a C++11 capable compiler

Quick start

To install, compile from source:

git clone https://github.com/medvedevgroup/UST
cd UST
make

After compiling, use

./ust -i [unitigs.fa] -k [kmer_size]

e.g.

./ust -i examples/k11.unitigs.fa -k 11

The important parameters are:

  • k [int] : The k-mer size that was used to generate the input, i.e. the length of the nodes of the node-centric de Bruijn graph.
  • i [input-file] : Unitigs file produced by BCALM2 in FASTA format.
  • a [0 or 1] : Default is 0. A value of 1 tells UST to preserve abundance. Use this option when the input file was generated with the -all-abundance counts option of BCALM2.

The output is a FASTA file with extenstion "ust.fa" in the working folder, which is the SPSS representaiton of the input. If the program is run with the option -a 1, an additional count file with extension "ust.counts" will also be generated.

Detailed Usage

In order to build a SPSS representation for your k-mer set, you must first run BCALM2 on your set of k-mers. BCALM2 will construct a set of unitigs. Those unitigs are then fed as input to ust, which outputs a FASTA file with the SPSS representation. Note that the k parameter to ust must match the -kmer-size used when running BCALM2.

If you would like to store the data on disk in compressed form (like UST-Compress in our paper), you can then install and run MFCompress on the output of UST as follows: MFCompressC mykmers.ust.fa

If you would like to build a membership data structure based on UST, then

  • Install bwtdisk and dbgfm.
  • Change the two variables "DBGFM_DIRECTORY" and "BWTDISK_DIRECTORY" in the script ust-fm.sh to point to the locations where dbgfm and bwtdisk are installed. Alternatively, you can add the path to both tools in your environment PATH variable and then modify the script accordingly.
  • Run ust-fm.sh as follows: ust-fm.sh mykmers.ust.fa

Citation

If using UST in your research, please cite

@inproceedings{RahmanMedvedevRECOMB20,
  author    = {Amatur Rahman and Paul Medvedev},
  title     = {Representation of $k$-mer sets using spectrum-preserving string sets},
  booktitle = {Research in Computational Molecular Biology - 24th Annual International Conference, {RECOMB} 2020, Padua, Italy, May 10-13, 2020, Proceedings},
  series    = {Lecture Notes in Computer Science},
  volume    = {12074},
  pages     = {152--168},
  publisher = {Springer},
  year      = {2020
}

Note that the general notion of an SPSS was independently introduced under the name of simplitigs. Therefore, if citing this general notion, please also cite:

About

License:GNU General Public License v3.0


Languages

Language:C++ 98.2%Language:Shell 1.2%Language:Makefile 0.6%