michael-golfi / flint

Main repository of the Flint project for Spark and Amazon EMR.

Home Page:https://camilo-v.github.io/flint/

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Flint

This is the main repository of the Flint project for Amazon Web Services. Flint is a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark's built-in parallelism and streaming engine architecture to quickly map reads against a large reference collection of bacterial genomes.

Our computational framework is primarily implemented using the MapReduce model, and deployed in a cluster launched using the Elastic Map Reduce service offered by AWS (Amazon Web Services). The cluster consists of multiple commodity worker machines (computational nodes), and in the current configuration of the cluster that we use, each worker machine consists of 15 GB of RAM, 8 vCPUs (a hyperthread of a single Intel Xeon core), and 100 GB of EBS disk storage. Each of the worker nodes will work in parallel to align the input sequencing DNA reads to a partitioned shard of the reference database; after the alignment step is completed, each worker node acts as a regular Spark executor node.

The current database for running Flint is version 41 from Ensembl Bacteria, but we are currently working on the latest version of RefSeq, which should be available this summer.

Publications

Valdes, Stebliankin, Narasimhan (2019), Large Scale Microbiome Profiling in the Cloud, ISMB 2019, in review.

How To Get Started

Communication

  • If you found a bug, open an issue and please provide detailed steps to reliably reproduce it.
  • If you have feature request, open an issue.
  • If you would like to contribute, please submit a pull request.

Requirements

Flint is designed to run on Apache Spark, but the current implementation is tuned for Amazon's EMR Elastic Map Reduce. The basic requirements for an EMR cluster are:

The basic requirements for the worker nodes are:

Bowtie2

Bowtie is required for the alignment step, and needs to be installed in all worker nodes of the Spark Cluster. See the Bowtie2 manual for more information.

Python Packages

The remaining requirements are python packages that Flint needs for a successful run, please refer to the package's documentation for instructions and/or installation instructions.

Contact

Contact Camilo Valdes for pull requests, bug reports, good jokes and coffee recipes.

Maintainers

Collaborators

License

The software in this repository is available under the MIT License. See the LICENSE file for more information.

About

Main repository of the Flint project for Spark and Amazon EMR.

https://camilo-v.github.io/flint/

License:MIT License


Languages

Language:Python 71.2%Language:Shell 26.5%Language:R 2.2%