Flint

This is the main repository of the Flint project for Amazon Web Services. Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large reference collection of bacterial genomes.

Our computational framework is primarily implemented using the MapReduce model, and deployed in a cluster launched using the Elastic Map Reduce service offered by AWS (Amazon Web Services). The cluster consists of multiple commodity worker machines (computational nodes), and in the current configuration of the cluster that we use, each worker machine consists of 15 GB of RAM, 8 vCPUs (a hyperthread of a single Intel Xeon core), and 100 GB of EBS disk storage. Each of the worker nodes will work in parallel to align the input sequencing DNA reads to a partitioned shard of the reference database; after the alignment step is completed, each worker node acts as a regular Spark executor node.

The current database for running Flint is version 41 from Ensembl Bacteria, but we are currently working on the latest version of RefSeq, which should be available this summer.

Publications

Valdes, Stebliankin, Narasimhan (2019), Large Scale Microbiome Profiling in the Cloud, ISMB 2019, in review.

How To Get Started

Download the Code and follow the instructions on how to create an EMR cluster, setup the streaming source, and start Flint.
Instructions, as well as a manual and reference, can be found at the project’s website.

Communication

If you found a bug, open an issue and please provide detailed steps to reliably reproduce it.
If you have feature request, open an issue.
If you would like to contribute, please submit a pull request.

Requirements

Flint is designed to run on Apache Spark, but the current implementation is tuned for Amazon's EMR Elastic Map Reduce. The basic requirements for an EMR cluster are:

The basic requirements for the worker nodes are:

Bowtie2

Bowtie is required for the alignment step, and needs to be installed in all worker nodes of the Spark Cluster. See the Bowtie2 manual for more information.

Python Packages

The remaining requirements are python packages that Flint needs for a successful run, please refer to the package's documentation for instructions and/or installation instructions.

Contact

Contact Camilo Valdes for pull requests, bug reports, good jokes and coffee recipes.

Maintainers

Camilo Valdes

Collaborators

License

The software in this repository is available under the MIT License. See the LICENSE file for more information.

About

Main repository of the Flint project for Spark and Amazon EMR.

https://camilo-v.github.io/flint/

MIT License

Languages

Language:Python 71.2%Language:Shell 26.5%Language:R 2.2%