mcgml / gwasvcf

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Reading, querying and writing GWAS summary data in VCF format

Lifecycle: experimental codecov R build status

Complete GWAS summary datasets are now abundant. A large repository of curated, harmonised and QC'd datasets is available in the IEU GWAS database. They can be queried via the API directly, or through the ieugwasr R package, or the ieugwaspy python package. However, for faster querying that can be used in a HPC environment, accessing the data directly and not through cloud systems is advantageous.

We developed a format for storing and harmonising GWAS summary data known as GWAS VCF format. All the data in the IEU GWAS database is available for download in this format. This R package provides fast and convenient functions for querying and creating GWAS summary data in GWAS VCF format. This package includes:

  • a wrapper around the bioconductor/VariantAnnotation package, providing functions tailored to GWAS VCF for reading, querying, creating and writing GWAS VCF format files
  • some LD related functions such as using a reference panel to extract proxies, create LD matrices and perform LD clumping
  • functions for harmonising a dataset against the reference genome and creating GWAS VCF files.

See also the gwasglue R package for methods to connect the VCF data to Mendelian randomization, colocalisation, fine mapping etc.

Installation

remotes::install_github("mrcieu/gwasvcf")

Usage

See vignettes here: https://mrcieu.github.io/gwasvcf.

Citation

If using GWAS-VCF files please reference the studies that you use and the following paper:

The variant call format provides efficient and robust storage of GWAS summary statistics. Matthew Lyon, Shea J Andrews, Ben Elsworth, Tom R Gaunt, Gibran Hemani, Edoardo Marcora. bioRxiv 2020.05.29.115824; doi: https://doi.org/10.1101/2020.05.29.115824

Reference datasets

Example GWAS VCF (GIANT 2010 BMI):

1000 genomes reference panels for LD for each superpopulation - used by default in OpenGWAS:

1000 genomes European reference panel for LD (legacy):

1000 genomes vcf harmonised against human genome reference:


Notes

Example data

data.vcf.gz and data.vcf.gz.tbi are the first few rows of the Speliotes 2010 BMI GWAS

The eur.bed/bim/fam files are the same range as data.vcf.gz, from here http://fileserve.mrcieu.ac.uk/ld/data_maf0.01_rs_ref.tgz

About

License:Other


Languages

Language:HTML 94.8%Language:R 4.7%Language:Shell 0.5%