dmlicht / FMAssembler

Assembler using an overlap graph and fm index for fun and education!

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

FMAssembler

A tool to efficiently assemble large datasets of overlapping strings (like genomes!). This project is a python implementation of an overlap graph based assembler that uses the FM-Index compressed data structure.

David Lichtenberg

hello [at] davidlichtenberg [dot] com

to run assembler:

python assembler.py /path/to/file_containing_newline_seperated_reads.txt

to run index on arbitrary text:

python fmindex.py /path/to/file_containing_text_to_index.txt

Overview

  • assembler.py Takes a set of reads from user and passes them to index. Then uses the index to query for overlap edges to pass to the overlap graph

  • fmindex.py builds compressed index of a given text and allows a user to search for occurences of the text. This uses pysuffix library for Karkainnen suffix sorting.

  • overlap_graph.py maintains data structure of edges and travelsal algorithm to return contigs implied by reads.

  • t_rank.py contains the checkpoint data structure for t_rank lookup with sublinear memory footprint (sublinear memory footprint not yet implemented - see below)

  • genome_splitter.py takes a genome sequences and cuts it up to simulate lots of reads for testing

  • bw_compressor.py (incomplete) contains testing algorithms for compressing the text, decompressing the text, and searching compressed text. (see below for what is incomplete)

  • ukkonen.py (incomplete) toy implementation of ukkonen's algorithm for suffix sorting.

Notes

suffix_array.py and tools.py taken directly from pysuffix library: https://code.google.com/p/pysuffix/ I take no credit for anything in those two files (except the comments saying that they aren't mine :P)

TODO time est

  • Run Length Encoding Burrows Wheeler Transform 20-30 m
  • sort seperator chars in burrows wheeler transform (hackish fix) 10-20 m
  • backwards search star (allows for any char matching in backwards search) 1-2 h
  • contig graph .5 - 2 h
  • more efficient transitive edge reduction 2+ h
  • read error correction not sure
  • fm-index merge algorithm not sure
  • re-implement fm index in c (port using cython) to save memory not sure

About

Assembler using an overlap graph and fm index for fun and education!


Languages

Language:Python 99.6%Language:Shell 0.4%