Hail

Hail is an open-source, scalable framework for exploring and analyzing genomic data. Starting from genetic data in VCF, BGEN or PLINK format, Hail can, for example:

load variant and sample annotations from text tables, JSON, VCF, VEP, and locus interval files
generate variant annotations like call rate, Hardy-Weinberg equilibrium p-value, and population-specific allele count
generate sample annotations like mean depth, imputed sex, and TiTv ratio
generate new annotations from existing ones as well as genotypes, and use these to filter samples, variants, and genotypes
find Mendelian violations in trios, prune variants in linkage disequilibrium, analyze genetic similarity between samples via the GRM and IBD matrix, and compute sample scores and variant loadings using PCA
perform variant, gene-burden and eQTL association analyses using linear, logistic, and linear mixed regression, and estimate heritability

This functionality and more is exposed through Python and backed by distributed algorithms built on top of Apache Spark to efficiently analyze gigabyte-scale data on a laptop or terabyte-scale data on a cluster, without the need to manually chop up data or manage job failures. Users can script pipelines or explore data interactively through Jupyter notebooks that flow between Hail with methods for genomics, PySpark with scalable SQL and machine learning algorithms, and pandas with scikit-learn and Matplotlib for results that fit on one machine. Hail also provides a flexible domain language to express complex quality control and analysis pipelines with concise, readable code.

The Hail project began in Fall 2015 to empower the worldwide genetics community to harness the flood of genomes to discover the biology of human disease. Hail has been used for dozens of major studies and is the core analysis platform of large-scale genomics efforts such as gnomAD.

Want to get involved in open-source development of methods or infrastructure? Check out the Github repo, chat with us in the Gitter dev room, and view our talks at Spark Summit East and Spark Summit West (below). Or come join us full-time!

Getting Started

To get started using Hail on your data or public data:

follow the installation instructions in Getting Started
check out the Overview, Tutorials, and Python API
chat with the Hail team in the Hail Gitter room

We encourage use of the Discussion Forum for user and dev support, feature requests, and sharing your Hail-powered science. Follow Hail on Twitter @hailgenetics. Please report any suspected bugs to github issues.

Hail Team

The Hail team is embedded in the Neale lab at the Stanley Center for Psychiatric Research of the Broad Institute of MIT and Harvard and the Analytic and Translational Genetics Unit of Massachusetts General Hospital.

Contact the Hail team at hail@broadinstitute.org.

Citing Hail

If you use Hail for published work, please cite the software:

Hail, https://github.com/hail-is/hail

and either the forthcoming manuscript describing Hail (if possible):

Cotton Seed, Alex Bloemendal, Jonathan M Bloom, Jacqueline I Goldstein, Daniel King, Timothy Poterba, Benjamin M. Neale. Hail: An Open-Source Framework for Scalable Genetic Data Analysis. In preparation.

or the following paper which includes a brief introduction to Hail in the online methods:

Andrea Ganna, Giulio Genovese, et al. Ultra-rare disruptive and damaging mutations influence educational attainment in the general population. Nature Neuroscience

About

Scalable genomic data analysis.

https://hail.is

MIT License

Languages

Language:Scala 67.1%Language:Python 24.1%Language:Jupyter Notebook 4.2%Language:C++ 1.5%Language:CSS 1.2%Language:Makefile 0.5%Language:Java 0.4%Language:JavaScript 0.4%Language:Batchfile 0.3%Language:XSLT 0.1%Language:R 0.1%Language:HTML 0.1%Language:Shell 0.1%