zhouyangyu / scalable-NGS

Scalable NGS pipeline implemented in Spark environment

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Scalable NGS pipeline on Dataproc

This porject aimed to solve the problem of too much data in NGS pipelines, especially WGS, WES. In summary, this project implements analysis tools that are built on Apache Spark to align, process, and analyze large NGS data. It inherit the benefits of Spark, such as scalable, elastic, and highly-available.

  • Implemented in Google Cloud Dataproc environment
  • Uses SparkBWA, GATK 4 beta, Cromwell Execution Engine
  • Speeds up end-to-end data processing by at least 3x (using a cluster of 5 worker nodes)

It also:

  • Imports and stores files on Google storage
  • Uses all common analysis tools with excellent documentation and support communities.
  • needs your improvements!

Installation

You need:

  • a Google Cloud account (it comes with $300 credit) or AWS
  • set up a Hadoop cluster using YARN, with Apache Spark 2.2 installed
  • initate all cluster nodes with cluster_init.sh (installs Maven, SparkBWA, GATK)

Todos

  • Implement workflow in WDL and Cromwell
  • ADAM
  • Better documentation

About

Scalable NGS pipeline implemented in Spark environment


Languages

Language:Shell 100.0%