zerland / apply_metaCCA

Exploring metaCCA method on UK Biobank data

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

Applying metaCCA to IEU-GWAS database

Project folder on epi-franklin:/projects/XremovedX/.

Project background

A dominant approach to genetic association studies is to perform univariate tests between genotype-phenotype pairs, where each SNP is examined independently for association with the given phenotype.

However, if we could analyse the results of multiple GWAS studies together in a joint-analysis, this would not only provide increased statistical power, but also may reveal certain complex associations that are only detectable when several variants or traits are tested jointly.

In IEU-GWAS db there is a lot of data (28K traits) that can be investigated in that way, with many traits that could have novel associations and correlations. We’re particularly interested in finding genetic correlations between phenotypes that may be collectively contributing to a groups of diseases. In other words, we are interested in finding genes with pleiotropic effects. Those genes are not well-defined but are abundant, as findings of many GWAS studies overlap. Pleiotropy term is used to describe a scenario when the same locus (SNP/gene) affects multiple traits, via two main mechanisms:

  • horizontal pleiotropy: may lead to a better understanding of biological processes that are common between traits
  • vertical pleiotropy: can inform on causality for intervention strategies for disease prevention.

diagram

metaCCA

Paper, R package vingette

Brief introduction

MetaCCA can be used to systematically identify potential pleiotropic genes using GWAS summary statistics by combining correlation signals among multiple traits.

  • metaCCA uses GWAS summary statistics (𝛽 and std.err)
  • Can combine single or multiple studies in one analysis
  • Can use multivariable representation of both genotype and phenotype
  • Based on CCA (canonical correlation analysis)
  • Result is the maximized correlation coefficient R1

metaCCA provides two types of the multivariate association analysis:

  • Single-SNP–multi-trait analysis: 1 SNP β†’ N traits

One genetic variant tested for an association with a set of phenotypic variables

  • Multi-SNP–multi-trait analysis: N SNPs [genes] β†’ N traits

    A set of genetic variants tested for an association with a set of phenotypic variables.

The method

metaCCA operates on three pieces of the full data covariance matrix:

  • S_XX of genotype-genotype correlations
  • S_XY of univariate genotype-phenotype association results
  • S_YY of phenotype-phenotype correlations.

S_XX is estimated from a reference database matching the study population, e.g. the 1000 Genomes. S_YY is estimated from S_XY.

diagram

Workflow

The analysis cointains several stages:

  1. Traits/data selection
  2. Input data processing/cleaning (both GWAS and reference)
  3. Input matrix generation (S_XY)
  4. Reference matrix generation (S_XX) *NB there may be some overlap/depencence between 3 and 4
  5. Run metaCCA script (submit on BC3)
  6. Output processing
  7. Output annotation with GWAS catalog
  8. Visualisation

Case studies

While exploting metaCCA, I have done several case studies to investigate various properties of metaCCA. Each case study is described in a separate README.

  1. UK Biobank only (easiest working case) here
  2. UK Biobank + GIANT here
  3. UK Biobank (IEU) + UK Biobank (Neale Lab)here

Scripts in this repo

Workflow-related

── main_workflow
β”‚   β”œβ”€β”€ select_traits/biobank_traits_parser.Rmd
β”‚   β”œβ”€β”€ parse_gwas_vcf.sh *OR*
β”‚   β”œβ”€β”€ parse_gwas_vcf_snakemake/ 
β”‚   β”œβ”€β”€ 0_standardise_nealelab_data.Rmd
β”‚   β”œβ”€β”€ 1_prepare_data_XX_by_chr.Rmd
β”‚   β”œβ”€β”€ 2_prepare_data_XY.Rmd
β”‚   β”œβ”€β”€ 3_run_metaCCA_analysis.R
β”‚   β”œβ”€β”€ 3_runmetaCCA_testing_manually.Rmd
β”‚   β”œβ”€β”€ 4_review_results_gwascat.Rmd
β”‚   β”œβ”€β”€ 5_visualise.Rmd
β”‚   β”œβ”€β”€ python_LDproxies

Exploratory scripts

β”œβ”€β”€ exploratory_analysis
β”‚   β”œβ”€β”€ compare_effect_size.Rmd
β”‚   β”œβ”€β”€ compare_r_and_r2_results.Rmd
β”‚   β”œβ”€β”€ compare_results_UKBvsUKBGIANT.Rmd
β”‚   └── manhattan_plot.Rmd

Data folders

.. are outside this repo, but the structure is as follows:

β”œβ”€β”€ 1000GPdata			# raw reference data
β”œβ”€β”€ annotation			# gene annotation files
β”œβ”€β”€ genotype_matrix_1	# interemetiate files from case study 1
β”œβ”€β”€ genotype_matrix_2	# interemetiate files from case study 2
β”œβ”€β”€ genotype_matrix_3	# interemetiate files from case study 3
β”œβ”€β”€ gwas_catalog		# GWAS catalog raw and subset and annotation files
β”œβ”€β”€ README.txt
β”œβ”€β”€ results				# weekly file storage
β”œβ”€β”€ snp_lists			# intermediate and common-to-all files
β”œβ”€β”€ S_XX_matrices		# per chr LD matrices for each case study
└── S_XY_matrices		# XY matrices for each case study + standardised tsv (from rawVCF)

External data

1kg European reference panel for LD data_maf0.01_rs_ref.tgz: 9,003,401

Located here: data/1000GPdata/data_maf0.01_rs_ref

Data viz

Showing some examples of plots I've made over the course of the project

GWAS catalog annotation plots

diagram
diagram

UKB VS GIANT explorations

diagram

diagram

diagram

diagram

diagram

About

Exploring metaCCA method on UK Biobank data


Languages

Language:R 59.3%Language:Python 22.5%Language:Shell 18.2%